41
Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020) Support Action Big Data Europe Empowering Communities with Data Technologies Project Number: 644564 Start Date of Project: 01/01/2015 Duration: 36 months Deliverable 5.4 Domain-Specific Big Data Integrator Instances II Dissemination Level Public Due Date of Deliverable Month 24, 01/01/2017 Actual Submission Date 24/02/2017 Work Package WP5, Big Data Integrator Instances Task T5.2 Type Other Approval Status Approved Version 1.00 Number of Pages 41 Filename D5.4_Domain-Specific Big Data Integrator Instances II Abstract: Documentation of the Big Data Integrator Instances deployed for executing the pilots of the Big Data Integrator across all seven societal challenges. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including but not limited to the fitness of the information for a particular purpose. The user thereof uses the information at his/ her sole risk and liability. Ref. Ares(2017)1004145 - 24/02/2017

Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

Project funded by the European Unionrsquos Horizon 2020 Research and Innovation Programme (2014 ndash 2020)

Support Action

Big Data Europe ndash Empowering Communities with

Data Technologies

Project Number 644564 Start Date of Project 01012015 Duration 36 months

Deliverable 54

Domain-Specific Big Data Integrator Instances II

Dissemination Level Public

Due Date of Deliverable Month 24 01012017

Actual Submission Date 24022017

Work Package WP5 Big Data Integrator Instances

Task T52

Type Other

Approval Status Approved

Version 100

Number of Pages 41

Filename D54_Domain-Specific Big Data Integrator

Instances II

Abstract Documentation of the Big Data Integrator Instances deployed for executing

the pilots of the Big Data Integrator across all seven societal challenges

The information in this document reflects only the authorrsquos views and the European Community is not liable for any use

that may be made of the information contained therein The information in this document is provided ldquoas isrdquo without

guarantee or warranty of any kind express or implied including but not limited to the fitness of the information for a

particular purpose The user thereof uses the information at his her sole risk and liability

Ref Ares(2017)1004145 - 24022017

D54 ndash v 100

Page

2

History

Version Date Reason Revised by

000 21012016 Document structure S Konstantopoulos

001 2122016 First draft based on descriptions of

second piloting cycle

A Charalambidis

S Konstantopoulos G

Stavrinos and I

Klampanos

002 1212017 Second draft based on the

feedback of the piloting partners

A Charalambidis

S Konstantopoulos

and V Karkaletsis

003 1212017 Corrections and comments for SC2

pilot

P Karampiperis

and P Zervas

004 1312017 Corrections and comments for SC7

pilot

G Papadakis

M Lazarrini

005 2212017 Corrections for SC1 pilot B Williams-Jones

K McNeice

006 2512017 SC3 pilot architecture A Charalambidis and I

Mouchakis

007 122017 Corrections and clarifications for

SC3

A Charalambidis and F

Mouzakis

008 922017 Peer review A Versteden and H

Jabeen

009 1622017 Address peer review comments

A Charalambidis

S Konstantopoulos

I Klampanos

J Jakobitsch and

K McNeice

100 2322017 Final version

D54 ndash v 100

Page

3

Author List

Organisation Name Contact Information

NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr

NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr

NCSR-D Angelos Charalambidis acharaliitdemokritosgr

NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr

NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr

Agroknow Pythagoras Karampiperis pythkagroknowcom

Agroknow Panagiotis Zervas pzervasagroknowcom

Open

PHACTS Bryn Williams-Jones brynopenphactsfoundationorg

Open

PHACTS Kiera McNeice kieraopenphactsfoundationorg

CRES Fragkiskos Mouzakis mouzakiscresgr

TenForce Aad Versteden aadverstedentenforcecom

UoB Hajira Jabeen hajirajabeengmailcom

SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat

D54 ndash v 100

Page

4

Executive Summary

This report documents the instantiations of the Big Data Integrator Platform that underlies the

pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon

2020 Societal Challenges These platform instances will be provided to the relevant networking

partners to be used for executing the pilot sessions foreseen in WP6

For each of the seven pilots this document provides (a) a brief summary of the pilot description

prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the

technical requirements for carrying out these use cases (c) an architecture that shows the BDI

components required to cover these requirements and (d) the list of components in the

architecture and their status (available as part of BDI or otherwise available or to be

developed as part of the pilot)

D54 ndash v 100

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe

The components that are made available to the pilots by BDI are listed

here httpsgithubcombig-data-europeREADMEwikiComponents

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1 Health Demographic Change and Wellbeing

SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry

Marine Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3 Secure Clean and Efficient Energy

SC4 Societal Challenge 4 Smart Green and Integrated Transport

SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative

and reflective societies

SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security

of Europe and its citizens

AK Agroknow Belgium

CERTH Centre for Research and Technology Greece

CRES Center for Renewable Energy Sources and Saving Greece

FAO Food and Agriculture Organization of the United Nations Italy

FhG Fraunhofer IAIS Germany

InfAI Institute for Applied Informatics Germany

NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece

OPF Open PHACTS Foundation UK

SWC Semantic Web Company Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam the Netherlands

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 2: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

2

History

Version Date Reason Revised by

000 21012016 Document structure S Konstantopoulos

001 2122016 First draft based on descriptions of

second piloting cycle

A Charalambidis

S Konstantopoulos G

Stavrinos and I

Klampanos

002 1212017 Second draft based on the

feedback of the piloting partners

A Charalambidis

S Konstantopoulos

and V Karkaletsis

003 1212017 Corrections and comments for SC2

pilot

P Karampiperis

and P Zervas

004 1312017 Corrections and comments for SC7

pilot

G Papadakis

M Lazarrini

005 2212017 Corrections for SC1 pilot B Williams-Jones

K McNeice

006 2512017 SC3 pilot architecture A Charalambidis and I

Mouchakis

007 122017 Corrections and clarifications for

SC3

A Charalambidis and F

Mouzakis

008 922017 Peer review A Versteden and H

Jabeen

009 1622017 Address peer review comments

A Charalambidis

S Konstantopoulos

I Klampanos

J Jakobitsch and

K McNeice

100 2322017 Final version

D54 ndash v 100

Page

3

Author List

Organisation Name Contact Information

NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr

NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr

NCSR-D Angelos Charalambidis acharaliitdemokritosgr

NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr

NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr

Agroknow Pythagoras Karampiperis pythkagroknowcom

Agroknow Panagiotis Zervas pzervasagroknowcom

Open

PHACTS Bryn Williams-Jones brynopenphactsfoundationorg

Open

PHACTS Kiera McNeice kieraopenphactsfoundationorg

CRES Fragkiskos Mouzakis mouzakiscresgr

TenForce Aad Versteden aadverstedentenforcecom

UoB Hajira Jabeen hajirajabeengmailcom

SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat

D54 ndash v 100

Page

4

Executive Summary

This report documents the instantiations of the Big Data Integrator Platform that underlies the

pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon

2020 Societal Challenges These platform instances will be provided to the relevant networking

partners to be used for executing the pilot sessions foreseen in WP6

For each of the seven pilots this document provides (a) a brief summary of the pilot description

prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the

technical requirements for carrying out these use cases (c) an architecture that shows the BDI

components required to cover these requirements and (d) the list of components in the

architecture and their status (available as part of BDI or otherwise available or to be

developed as part of the pilot)

D54 ndash v 100

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe

The components that are made available to the pilots by BDI are listed

here httpsgithubcombig-data-europeREADMEwikiComponents

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1 Health Demographic Change and Wellbeing

SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry

Marine Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3 Secure Clean and Efficient Energy

SC4 Societal Challenge 4 Smart Green and Integrated Transport

SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative

and reflective societies

SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security

of Europe and its citizens

AK Agroknow Belgium

CERTH Centre for Research and Technology Greece

CRES Center for Renewable Energy Sources and Saving Greece

FAO Food and Agriculture Organization of the United Nations Italy

FhG Fraunhofer IAIS Germany

InfAI Institute for Applied Informatics Germany

NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece

OPF Open PHACTS Foundation UK

SWC Semantic Web Company Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam the Netherlands

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 3: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

3

Author List

Organisation Name Contact Information

NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr

NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr

NCSR-D Angelos Charalambidis acharaliitdemokritosgr

NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr

NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr

Agroknow Pythagoras Karampiperis pythkagroknowcom

Agroknow Panagiotis Zervas pzervasagroknowcom

Open

PHACTS Bryn Williams-Jones brynopenphactsfoundationorg

Open

PHACTS Kiera McNeice kieraopenphactsfoundationorg

CRES Fragkiskos Mouzakis mouzakiscresgr

TenForce Aad Versteden aadverstedentenforcecom

UoB Hajira Jabeen hajirajabeengmailcom

SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat

D54 ndash v 100

Page

4

Executive Summary

This report documents the instantiations of the Big Data Integrator Platform that underlies the

pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon

2020 Societal Challenges These platform instances will be provided to the relevant networking

partners to be used for executing the pilot sessions foreseen in WP6

For each of the seven pilots this document provides (a) a brief summary of the pilot description

prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the

technical requirements for carrying out these use cases (c) an architecture that shows the BDI

components required to cover these requirements and (d) the list of components in the

architecture and their status (available as part of BDI or otherwise available or to be

developed as part of the pilot)

D54 ndash v 100

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe

The components that are made available to the pilots by BDI are listed

here httpsgithubcombig-data-europeREADMEwikiComponents

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1 Health Demographic Change and Wellbeing

SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry

Marine Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3 Secure Clean and Efficient Energy

SC4 Societal Challenge 4 Smart Green and Integrated Transport

SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative

and reflective societies

SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security

of Europe and its citizens

AK Agroknow Belgium

CERTH Centre for Research and Technology Greece

CRES Center for Renewable Energy Sources and Saving Greece

FAO Food and Agriculture Organization of the United Nations Italy

FhG Fraunhofer IAIS Germany

InfAI Institute for Applied Informatics Germany

NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece

OPF Open PHACTS Foundation UK

SWC Semantic Web Company Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam the Netherlands

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 4: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

4

Executive Summary

This report documents the instantiations of the Big Data Integrator Platform that underlies the

pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon

2020 Societal Challenges These platform instances will be provided to the relevant networking

partners to be used for executing the pilot sessions foreseen in WP6

For each of the seven pilots this document provides (a) a brief summary of the pilot description

prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the

technical requirements for carrying out these use cases (c) an architecture that shows the BDI

components required to cover these requirements and (d) the list of components in the

architecture and their status (available as part of BDI or otherwise available or to be

developed as part of the pilot)

D54 ndash v 100

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe

The components that are made available to the pilots by BDI are listed

here httpsgithubcombig-data-europeREADMEwikiComponents

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1 Health Demographic Change and Wellbeing

SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry

Marine Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3 Secure Clean and Efficient Energy

SC4 Societal Challenge 4 Smart Green and Integrated Transport

SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative

and reflective societies

SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security

of Europe and its citizens

AK Agroknow Belgium

CERTH Centre for Research and Technology Greece

CRES Center for Renewable Energy Sources and Saving Greece

FAO Food and Agriculture Organization of the United Nations Italy

FhG Fraunhofer IAIS Germany

InfAI Institute for Applied Informatics Germany

NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece

OPF Open PHACTS Foundation UK

SWC Semantic Web Company Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam the Netherlands

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 5: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

5

Abbreviations and Acronyms

BDI

The Big Data Integrator platform that is developed within Big Data Europe

The components that are made available to the pilots by BDI are listed

here httpsgithubcombig-data-europeREADMEwikiComponents

BDI

Instance

A specific deployment of BDI complemented by tools specifically

supporting a given Big Data Europe pilot

BT Bluetooth

ECMWF European Centre for Medium range Weather Forecasting

ESGF Earth System Grid Federation

FCD Floating Car Data

LOD Linked Open Data

SC1 Societal Challenge 1 Health Demographic Change and Wellbeing

SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry

Marine Maritime and Inland Water Research and the Bioeconomy

SC3 Societal Challenge 3 Secure Clean and Efficient Energy

SC4 Societal Challenge 4 Smart Green and Integrated Transport

SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency

and Raw Materials

SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative

and reflective societies

SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security

of Europe and its citizens

AK Agroknow Belgium

CERTH Centre for Research and Technology Greece

CRES Center for Renewable Energy Sources and Saving Greece

FAO Food and Agriculture Organization of the United Nations Italy

FhG Fraunhofer IAIS Germany

InfAI Institute for Applied Informatics Germany

NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece

OPF Open PHACTS Foundation UK

SWC Semantic Web Company Austria

UoA National and Kapodistrian University of Athens

VU Vrije Universiteit Amsterdam the Netherlands

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 6: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

6

Table of Contents 1 Introduction 9

11 Purpose and Scope 9

12 Methodology 9

2 Second SC1 Pilot Deployment 10

21 Use Cases 10

22 Requirements 10

23 Architecture 12

24 Deployment 12

3 Second SC2 Pilot Deployment 14

31 Overview 14

32 Requirements 15

33 Architecture 17

34 Deployment 18

4 Second SC3 Pilot Deployment 20

41 Overview 20

42 Requirements 21

43 Architecture 22

44 Deployment 23

5 Second SC4 Pilot Deployment 24

51 Use cases 24

52 Requirements 24

53 Architecture 26

54 Deployment 27

6 Second SC5 Pilot Deployment 28

61 Use cases 28

62 Requirements 29

63 Architecture 30

64 Deployment 31

7 Second SC6 Pilot Deployment 32

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 7: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

7

71 Use cases 32

72 Requirements 33

73 Architecture 34

74 Deployment 35

8 Second SC7 Pilot Deployment 36

81 Use cases 36

82 Requirements 37

83 Architecture 38

84 Deployment 39

9 Conclusions 41

List of Tables

Table 1 Requirements of the Second SC1 Pilot 11

Table 2 Components needed to Deploy Second SC1 Pilot 13

Table 3 Requirements of the Second SC2 Pilot 16

Table 4 Components needed to deploy the Second SC2 Pilot 19

Table 5 Requirements of the Second SC3 Pilot 21

Table 6 Components needed to deploy the Second SC3 Pilot 23

Table 7 Requirements of the Second SC4 Pilot 25

Table 8 Components needed to deploy the Second SC4 Pilot 28

Table 9 Requirements of the Second SC5 Pilot 29

Table 10 Components needed to deploy the Second SC5 Pilot 31

Table 11 Requirements of the Second SC6 Pilot 33

Table 12 Components needed to deploy the Second SC6 Pilot 36

Table 13 Requirements of the Second SC7 Pilot 38

Table 14 Components needed to deploy the Second SC7 Pilot 40

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 8: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

8

List of Figures

Figure 1 Architecture of the Second SC1 Pilot 12

Figure 2 Architecture of the Second SC2 Pilot 17

Figure 3 Architecture of the Second SC3 Pilot 22

Figure 4 Architecture of the Second SC4 Pilot 26

Figure 5 Architecture of the Second SC5 Pilot 30

Figure 6 Architecture of the Second SC6 Pilot 34

Figure 7 Architecture of the Second SC7 Pilot 38

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 9: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

9

1 Introduction

11 Purpose and Scope

This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving

the needs of the domains examined within Big Data Europe These platform instances will be

provided to the relevant networking partners to execute the pilots foreseen in WP6

12 Methodology

Task 52 focuses on the application of the generic Instantiation methodology in a specific Use

Case pertaining to domains closely related to Europersquos Social challenges To this end T52

comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application

Participating partners and their role NCSR-D (task leader) deploys the different instantiations

of the Big Data Integrator Platform and supports the partners carrying out each pilot with

consulting about the platform This task includes two phases the design and the deployment

phase The design phase involves the following

Review the pilot descriptions prepared in WP6 and request clarifications where needed

in order to prepare a detailed technical description of the platform that will support the

pilot

Prepare a first draft of the sections for the second cycle pilots where use cases and

workflow from the pilot descriptions are summarized and technical requirements and

an architecture for each pilot-specific platform is drafted

Cooperate with the persons responsible for each pilot to update the pilot description

and the technical description in this deliverable so that they are consistent and

satisfactory This draft also includes a list of components and their availability (a) base

platform components that are prepared in WP4 (b) pilot-specific components that are

already available or (c) pilot-specific components that will be developed for the pilot

Components are also assigned a partner responsible for their implementation

Review the pilot technical descriptions from the perspective of bridging between

technical work and the community requirements to establish that the pilot is relevant

to the communities it is aimed at

During deployment phase work in this task will follow and document development of the

individual components and test their integration into the platform

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 10: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

10

2 Second SC1 Pilot Deployment

21 Use Cases

The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and

Wellbeing

The pilot demonstrates the workflow of reproducing the functionality of an existing data

integration and processing system (the Open PHACTS Discovery Platform) on BDI The

second pilot extends the first pilot (cf D52 Section 2) with the following

Discussions with stakeholders and other Societal Challenges will identify how the

existing Open PHACTS platform and datasets may potentially be used to answer

queries in other domains In particular applications in Societal Challenge 2 (food

security and sustainable agriculture) where the effects of chemistry (eg pesticides)

on biology are probed in plants could exploit the linked data services currently within

the OPF platform This will require discussing use case specifics with SC2 to

understand their requirements and ensure that the OPF data is applicable Similarly

we will explore whether SC2 data could be linked to the OPF data platform is relevant

for early biology research

No specific new datasets are targeted for integration in the second pilot However if

datasets to be made available through other pilots have clear potential links to Open

PHACTS datasets these will be considered for integration into the platform to offer

researchers the ability to pose more complex queries across a wider range of data

The second pilot will aim to expand on first pilot by refreshing the datasets integrated

into the pilot Homogenising and integrating the new data available for these datasets

and developing ways to update datasets by integrating new data on an ongoing basis

will enable new use cases where researchers require fully current datasets for their

queries

The second pilot will also simplify existing workflows for querying the API for example

with components for common software tools such as KNIME reducing the barrier for

academic institutions and companies to access the platform for knowledge- and data-

driven biomedical research use cases

22 Requirements

Table 1 lists the ingestion storage processing and output requirements set by this pilot

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 11: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

11

Requirement Comment

R1 The solution should be

packaged in a way such that it is

possible to combine the Open

PHACTS Docker and the BDE

platform to achieve a custom

integrated solution

Specificities of the services of the Open PHACTS

Discovery Platform should not be hard-wired into

the domain-specific instance but should be read

from a configuration file (such as SWAGGER)

The BDE instance should offer or apply these

external services over data hosted by the BDE

instance

R2 RDF data storage The current Open PHACTS Discovery Platform is

based on distributed Virtuoso a proprietary

solution The BDE platform will provide a

distributed 4store and SANSA to be compared

with the Open PHACTS Discovery Platform

R3 Datasets are aligned and linked

at data ingestion time and the

transformed data is stored

In conjunction with R1 a modular data ingestion

component should dynamically decide which data

transformers to invoke

R4 Data and query security and

privacy requirements

A BDI local deployment holds private data and

serves private queries BDE does not foresee any

specific technical support for query obfuscation

so remote data sources need to be cloned locally

to guarantee query privacy

Table 1 Requirements of the Second SC1 Pilot

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 12: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

12

Figure 1 Architecture of the Second SC1 Pilot

Figure 1 Architecture of the Second SC1 pilot

23 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

Distributed triple store for the data The second pilot cycle will also test the feasibility of

using SANSA stack1 as an alternative of SPARQL query processing

Processing infrastructures

Scientific Lenses query expansion

Other modules

Data connector including the data transformation modules for the alignment of data at

ingestion time

REST API for querying that builds a SPARQL query by using keywords to fill in pre-

defined query templates The querying services also uses Scientific Lenses to expand

queries

24 Deployment

Table 2 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

1 httpsansa-stacknet

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 13: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

13

Table 2 Components needed to Deploy Second SC1 Pilot

Module Task Responsible

4store BDI dockers made available by WP4 NCSR-D

SANSA stack BDI dockers made available by WP4 FhGUniBonn

Data connector and

transformation modules

Develop a dynamic transformation

engine that uses SWAGGER

descriptions to select the appropriate

transformer

VU

Query endpoint Develop a dynamic query re-write

engine that uses SWAGGER

descriptions to select the transformer

VU

Scientific Lenses query

expansion module

Needs to be deployed and tested

unless an existing live service will be

used for the BDE pilot

VU

Table 2 Components needed to Deploy Second SC1 Pilot

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 14: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

14

3 Second SC2 Pilot Deployment

31 Overview

The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable

Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy

The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the

relevant data sources and extending the data processing needed to handle a variety of data

types (apart from bibliographic data) relevant to Viticulture

The pilot demonstrates the following workflows

1 Text mining workflow Automatically annotating scientific publications by (a) extracting

named entities (locations domain terms) and (b) extracting the captions of images

figures and tables The extracted information is provided to viticultural researchers via

a GUI that exposes search functionality

2 Data processing workflow The end users (viticultural researchers) upload scientific

data in a variety of formats and provide the metadata needed in order to correctly

interpret the data The data is ingested and homogenized so that it can be compared

and connected with other relevant data originally in diverse formats The data is

exposed to viticultural researchers via a GUI that exposes searchdiscovery

aggregation analysis correlation and visualization functionalities over structured data

The results of the data analysis will be stored in the infrastructure to avoid carrying out

the same processing multiple times with appropriate provence for future reference

publication and scientific replication

3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg

pruning harvesting etc) by cross-examining the weather data observed in the area of

the vineyard with the appropriate weather conditions needed for the aforementioned

operations

4 Variety identification workflow The end users complete an on-spot questionnaire

regarding the characteristics of a specific grape variety Together with the geolocation

of the questionnaire this information is used to identify a grape variety

The following datasets will be involved

The AGRIS and PubMed datasets that include scientific publications

Weather data available via publicly-available API such as AccuWeather

OpenWeatherMap Weather Underground

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 15: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

15

User-generated data such as geotagged photos from leaves young shoots and grape

clusters ampelographic data SSR-marker data that will be provided by the VITIS

application

OIV Descriptor List2 for Grape Varieties and Vitis species

Crop Ontology

The following processing is carried out

Named entity extraction

Researcher affiliation extraction and verification

Variety identification

Phenologic modelling

PDF structure processing to associate tables and diagrams with captions

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information topics extracted from scientific publications

Metadata for dataset searching and discovery

Aggregation analysis correlation results

32 Requirements

Table 3 lists the ingestion storage processing and output requirements set by this pilot

Table 3 Requirements of the Second SC2 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results and their lineage

metadata When starting up processing

modules should check at the metadata

registry if intermediate results are available

R2 Extracting images and their captions

from scientific publications

To be developed for the pilot taking into

account R1

2 httpwwwoivinten

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 16: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

16

R3 Extracting thematic annotations from

text in scientific publications

To be developed for the pilot taking into

account R1

R4 Extracting researcher affiliations from

the scientific publications

To be developed for the pilot taking into

account R1

R5 Variety identification To be developed for the pilot taking into

account R1

R6 Phenolic modeling To be developed for the pilot taking into

account R1

R5 Expose data and metadata in JSON

through a Web API

Data ingestion module should write JSON

documents in HDFS 4store should be

accessed via a SPARQL endpoint that

responds with results in JSON

Table 3 Requirements of the Second SC2 Pilot

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 17: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

17

Figure 2 Architecture of the Second SC2 Pilot

Figure 2 Architecture of the Second SC2 Pilot

33 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing publication full-text and ingested datasets

A graph database for storing publication metadata (terms and named entities)

affiliation metadata (connections between researchers) weather metadata and VITIS

metadata

Processing infrastructures

Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from

publication full-text These tools will react on Kafka messages Spark and UnifiedViews

will be evaluated for this task

3 Cf httpwwwunifiedviewseu

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 18: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

18

PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment

of the dataset will be explored eg via linking to DBpedia or other LOD sources

AKSTEM the process of discovering relations and associations between organizations

and people in the field of viticulture research

Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in

the context of an Apache Spark application

Variety Identification already developed in AK VITIS will be adapted to work in the

context of an Apache Spark application

Extraction of images and figures and their captions from publication PDFs

Data analysis which writes analysis results back into the infrastructure to be retrieved

for visualization Data analysis should accompany each write-back with appropriate

metadata that specify the processing lineage of the derived dataset Intermediate

results should also be written out (and described as such in the metadata) in order to

allow resuming processing after a failure

Other modules

Flume for publication ingestion For every source that will be ingested into the system

there will be a flume agent responsible for data ingestion and basic

modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

34 Deployment

Table 4 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 4 Components needed to deploy the Second SC2 Pilot

Module Task Responsible

Spark over HDFS Flume

Kafka

BDI dockers made available by WP4 FH TF InfAI

SWC

4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 19: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

19

GraphDB andor Neo4j

dockerization

To be investigated if the Docker

images provided by the official

systems6 are suitable for the pilot If

not will be altered for the pilot or use

an already dockerized triple store such

as Virtuoso or 4store

SWC

Flume agents for publication

ingestion and processing

To be developed for the pilot SWC

Flume agents for data

ingestion

To be extended for the pilot in order to

support the introduced datasets

(accuweather data user-generated

data)

SWC AK

Data storage schema To be developed for the pilot SWC AK

Phenolic modelling To be adapted from AK VITIS for the

pilot

AK

Spark AKSTEM To be adapted from AK STEM for the

pilot

AK

Variety Identification To be adapted from AK VITIS for the

pilot

AK

Table 4 Components needed to deploy the Second SC2 Pilot

6 httpsneo4jcomdeveloperdocker

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 20: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

20

4 Second SC3 Pilot Deployment

41 Overview

The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy

The second pilot cycle extends the first pilot by adding additional online and offline data

analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such

as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the

following workflow a developer in the field of wind energy enhances condition monitoring for

each unit in a wind farm by pooling together data from multiple units from the same farm (to

consider the cluster operation in total) and third party data (to perform correlated assessment)

The custom analysis modules created by the developer use both raw data that are transferred

offline to the processing cluster and condensed data streamed online at the same time order

that the event occurs

The following datasets are involved

Raw sensor and SCADA data from a given wind farm

Online stream data comprised of parametrics and statistics extracted from the raw

SCADA data

Raw sensor data from Acoustic Emissions module from a given wind farm

All data is in custom binary or ASCII formats ASCII files contain a metadata header and in

tabulated form the signal data (signal in columns time sequence in rows) All data is annotated

by location time and system id

The following processing is carried out

Near-real time execution of parametrized models to return operational statistics

warnings including correlation analysis of data across units

Weekly execution of operational statistics

Weekly execution of model parametrization

Weekly specific acoustic emissions DSP

The following outputs are made available for visualization or further processing

Operational statistics near-real time and weekly

Model parameters

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 21: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

21

42 Requirements

Table 5 lists the ingestion storage processing and output requirements set by this pilot Since

the second cycle of the pilot extends the first pilot some requirements are identical and

therefore omitted from Table 5

Table 5 Requirements of Second SC3 Pilot

Requirement Comment

R1 The online data will be sent (via

OPC) from the intermediate

(local) processing level to BDI

A data connector must be developed that provides

for receiving OPC streams from an OPC-

compatible server

R2 The application should be able

to recover from short outages by

collecting the data transmitted

during the outage from the data

sources

An OPC data connector must be developed that

can retrieve the missing data collected at the

intermediate level from the distributed data

historian systems

R3 Near-realtime execution of

parametrized models to return

operational statistics including

correlation analysis of data

across units

The analysis software should write its results back

into a specified format and data model that is

appropriate input for further analysis

R4 The GUI supports database

querying and data visualization

for the analytics results

The GUI will be able to access files in the format

and data model

Table 5 Requirements of the Second SC3 Pilot

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 22: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

22

Figure 3 Architecture of the Second SC3 Pilot

Figure 3 Architecture of the Second SC3 Pilot

43 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS that stores binary blobs each holding a temporal slice of the complete data The

slicing parameters are fixed and can be applied at data ingestion time

A Postgres relational database to store the warnings operational statistics and the

output of the analysis The schema will be defined in a later

A Kafka broker that will distribute the continuous stream of CMS to model execution

Processing infrastructures

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 23: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

23

A processor that operates upon temporal slices of data

A Spark module that orchestrates the application of the processor on slices

A Spark streaming module that operates on the online data

Other modules

A data connector that offers an ingestion endpoint andor can retrieve from remote data

sources using the FTP protocol

A data connector that offers an ingestion endpoint that can retrieve an online stream

using OPC protocol and publish it to a Kafka topic

Data visualization that can visualize the data files stored in HDFS

44 Deployment

Table 6 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 6 Components needed to deploy the Second SC3 Pilot

Module Task Responsible

Spark HDFS Postgres

Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Acoustic Emissions DSP To be developed for the pilot CRES

OPC Data connector To be developed for the pilot CRES

Data visualization To be extended for the pilot CRES

Table 6 Components needed to deploy the Second SC3 Pilot

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 24: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

24

5 Second SC4 Pilot Deployment

51 Use cases

The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated

Transport

The pilot demonstrates how to implement the workflow for ingesting processing and storing

stream and historical traffic data in a distributed environment The pilot demonstrates the

following workflows

The map matching of the Floating Car Data (FCD) stream that is generated by the taxi

fleet The FCD data that represents the position of cabs using latitude and longitude

coordinates must be map matched to the roads on which the cabs are driving in order

to infer the traffic conditions of the roads The map matching is done through an

algorithm using a geographical database and topological rules

The monitoring of the current traffic conditions that consumes the mapped FCD data

and infers the traffic conditions of the roads

The forecasting of future traffic conditions based on a model that is trained from

historical and real-time mapped FCD data

The second pilot is based upon the processing modules developed in the first pilot (cf D52

Section 5) namely the processing modules developed by CERTH to analyze traffic data

classify traffic conditions The second pilot will also develop the newly added workflow of the

traffic forecasting and model training that did not exist during the first pilot cycle

The data sources available for the pilot are

A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis

containing information about the position speed and direction of the cabs

A historical database of recorded FCD data

A geographical database with information about the road network in Thessaloniki

The results of traffic monitoring and traffic forecasting are saved into a database for querying

statistics and visualizations

52 Requirements

Table 7 lists the ingestion storage processing and output requirements set by this pilot Since

the present pilot cycle is an extension of the first pilot the requirements of the first pilot also

apply Table 13 lists only the new requirements

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 25: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

25

Table 7 Requirements of the Second SC4 Pilot

Requirement Comment

R1 The pilot will enable the

evaluation of the present and

future traffic conditions (eg

congestion) within temporal

windows

The FCD map matched data are used to determine

the current traffic condition and to make predictions

within different time windows

R2 The traffic predictions will be

saved in a database

Traffic condition and prediction will be used for

queries statistics evaluation of the quality of

predictions visualizations

R3 The pilot can be started in two

configurations single node (for

development and testing) and

cluster (production)

It must be possible to run all the pilot components

in one single node for development and testing

purposes The cluster configuration must provide

cluster of any components messaging system

(Kafka) processing modules (Flink Spark

TensorFlow) storage (Postgres)

Table 7 Requirements of the Second SC4 Pilot

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 26: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

26

Figure 4 Architecture of the Second SC4 Pilot

Figure 4 Architecture of the Second SC4 Pilot

53 Architecture

The architecture of the pilot has been designed taking into consideration the data sources

mostly streams the processing steps needed and the information that needs to be computed

The pilot will ingest data from a near real-time FCD data stream from cabs and from historical

FCD data The FCD data needs to be preprocessed for map matching before being used for

classificationprediction

Apache Kafka will be used to distribute the computations as it provides a scalable fault

tolerant messaging system The processing of the data streams will be performed within

temporal windows Apache Flink will be used for the map matching algorithm in the same

manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a

platform to implement the traffic forecasting algorithm

The algorithms used for the map matching and classification will be provided using R as

it provides a good support for machine learning algorithms and because it is commonly used

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 27: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

27

and well known by researchers at CERTH In order to use the R packages in a Flink application

developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks

will be used for the traffic forecasting module

The traffic conditions and prediction computation will be stored in a scalable fault tolerant

database such as Elasticsearch The storage system must support spatial and temporal

indexing

54 Deployment

Table 8 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 8 Components needed to deploy Second SC4 Pilot

Module Task Responsible

PostGIS Elasticsearch

Kafka Flink Spark

TensorFlow

BDI dockers made available by WP4 NCSR-D SWC

TF FhG

A Kafka producer for FCD

data stream (source URL)

and historical data (source

file system)

Develop a Kafka producer to collect

the FCD data as a stream from web

services and from the file system for

the historical data sets and send them

to a Kafka topic

FhG

Kafka brokers Install Kafka to provide a message

broker and the topics

SWC

A Spark application for traffic

forecasting and model

training

Develop a Spark application that

consumes FCD matched data from a

Kafka topic The application will train a

prediction model and write the traffic

predictions to ElasticSearch

FhG

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 28: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

28

A Kafka consumer for storing

analysis results

Develop a Kafka consumer that stores

the result of the Traffic Classification

and prediction module

FhG

Table 8 Components needed to deploy the Second SC4 Pilot

6 Second SC5 Pilot Deployment

61 Use cases

The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource

Efficiency and Raw Materials

The pilot demonstrates the following workflow A (potentially hazardous) substance is released

in the atmosphere that results to increased readings in one or more monitoring stations The

user accesses a user interface provided by the pilot to define the locations of the monitoring

stations as well as a timeseries of the measured values (eg gamma dose rate) The platform

initiates

a weather matching algorithm that is a search for similarity of the current weather and

the pre-computed weather patterns as well as

a dispersion matching algorithm that is a search for similarity of the current substance

dispersion patterns with the precomputed ones

The weather patterns have been extracted in a pre-processing step by clustering weather

conditions recorded in the past while the substance dispersion patterns have been

precomputed by simulating different scenarios of substance release and weather conditions

The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request

The following datasets are involved

NetCDF files from the European Centre for Medium range Weather Forecasting

(ECMWF7)

GRIB files from National Oceanic and Atmospheric Administration (NOAA8)

The following processing will be carried out

The weather clustering algorithm that creates clusters of similar weather conditions

implemented using the BDI platform (see Section 63)

7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 29: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

29

The WRF downscaling that takes as input a low resolution weather and creates a high

resolution weather

The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model

computes dispersion patterns given predominant weather conditions

The following outputs are made available for visualization or further processing

The dispersions produced by DIPCOT

The weather clusters produced by the weather clustering algorithm

62 Requirements

Table 9 lists the ingestion storage processing and output requirements set by this pilot

Table 9 Requirements of Second SC5 Pilot

Requirement Comment

R1 Provide a means of downloading

currentevaluation weather from

ECMWF or alternative services

Data connectorinterface needs to be developed

R2 ECMWF and NOAA datasets are

compatible with the WRF and

DIPCOT naming conventions

A preprocessing WPS normalization step will

perform the necessary transformations and

variable renamings needs to ensure compatibility

R3 Retrieve NetCDF files from HDFS

as input to the weather clustering

algorithm

R4 Dispersion matching will filter on

dispersion values

Relational database will provide indexes on

dispersion values for efficient dispersion search

R5 Dispersion visualization Weather and dispersion matching must produce

output compatible with Sextantrsquos input or Sextant

must be modified to support new input

Table 9 Requirements of the Second SC5 Pilot

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 30: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

30

Figure 5 Architecture of the Second SC5 Pilot

Figure 5 Architecture of the Second SC5 Pilot

63 Architecture

To satisfy the requirements described above the following components will be deployed

Storage infrastructure

HDFS for storing NetCDF and GRIB files

Postgres for storing dispersions

Processing components

Scilearn-kit or TensorFlow to host the weather clustering algorithm

Other modules

ECMWF and NOAA data connectors

WPS normalization procedure

WRF downscaling component

DIPCOT atmospheric dispersion model

Weather and dispersion matching

Sextant for visualizing the dispersion layer

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 31: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

31

64 Deployment

Table 10 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 10 Components needed to deploy the Second SC5 Pilot

Module Task Responsible

HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D

Scikit-learn TensorFlow To be developed in the pilot NCSR-D

DIPCOT To be packaged in the pilot NCSR-D

Weather clustering algorithm To be developed in the pilot NCSR-D

Weather matching To be developed in the pilot NCSR-D

Dispersion matching To be developed in the pilot NCSR-D

ECMWF and NOAA data

connector

To be developed in the pilot NCSR-D

Data visualization UI To be developed in the pilot NCSR-D

Table 10 Components needed to deploy the Second SC5 Pilot

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 32: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

32

7 Second SC6 Pilot Deployment

71 Use cases

The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world

- inclusive innovative and reflective societies

The pilot demonstrates the following workflow Municipality economic data (ie budget and

budget execution data) are ingested at a regular basis (daily weekly and so on) from a series

of locations in a variety of structures and formats are homogenized so that they can be

compared analyzed and visualized in a comprehensible way The data is exposed to users

via a dashboard that exposes searchdiscovery aggregation analysis correlation and

visualization functionalities over structured data The results of the data analysis will be stored

in the infrastructure to avoid carrying out the same processing multiple times

The second cycle of the pilot will extend the first pilot by incorporating different formats by

developing a modular parsing library

The following datasets are involved

Budget execution data of Municipality of Athens

Budget execution data of Municipality of Thessaloniki

Budget execution data of Municipality of Barcelona

The current datasets involved are exposed either as an API or as CSV XML files

Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies

Statistical data will be described in the RDF DataCube12 vocabulary

The following processing is carried out

Data ingestion and homogenization

Aggregation analysis correlation over scientific data

The following outputs are made available for visualization or further processing

Structured information extracted from budget datasets exposed as a SPARQL endpoint

Metadata for dataset searching and discovery

9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 33: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

33

Aggregation and analysis

72 Requirements

Table 11 lists the ingestion storage processing and output requirements set by this pilot

Table 11 Requirements of the Second SC6 Pilot

Requirement Comment

R1 In case of failures during data

processing users should be able to

recover data produced earlier in the

workflow along with their lineage and

associated metadata

Processing modules should periodically

store intermediate results When starting

up processing modules should check at the

metadata registry if intermediate results are

available

R2 Transform budget data into a

homogenized format using various

parsers

Parsers will be developed for the pilot

taking into account R1

R3 Expose data and metadata through a

SPARQL endpoint

The triple store should be accessed via a

SPARQL endpoint

R4 Intuitive easy-to-use interface for

searching and selecting relevant data

sources The use of the user interface

should be documented so that users

can ease into using it with as little

effort as possible

The GraphSearch UI will be used to create

visualizations from SPARQL queries

Table 11 Requirements of the Second SC6 Pilot

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 34: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

34

Figure 6 Architecture of the Second SC6 Pilot

Figure 6 Architecture of the Second SC6 Pilot

73 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing ingested datasets

4store for storing homogenized statistical data and dataset metadata

Processing infrastructures

Metadata extraction Spark is used to extract RDF data and metadata from budget

data These tools will react on Kafka messages

PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the

terms in the ingested documents (eg bio terms locations and other named entities)

For this step the SWC PoolParty Semantic Suite14 will be used as an external service

13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 35: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

35

PoolParty is accessible from the BDE components via an HTTP API The connection

between Spark and PoolParty has been implemented in the first pilot cycle Additional

enrichment of the dataset will be explored eg via linking to DBpedia or other LOD

sources

Data analysis that will be performed on demand by pre-defined queries in the

dashboard

Other modules

Flume for dataset ingestion For every source that will be ingested into the system there

will be a flume agent responsible for data ingestion and basic modificationunification

Kafka as soon as a new record is available a Kafka message will be produced One

kafka consumer stores raw data into HDFS

A set of pre-defined SPARQL queries that carry out analytical aggregations important

comparisons and or other analysis of the data

GUI that provide functionality for (a) metadata searching to discover datasets data and

publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in

the form of a visual dashboard realised in d3js15

GraphSearch as the user interface

74 Deployment

Table 12 lists the components provided to the pilot as part of BDI and components that will be

developed within WP6 in the context of executing the pilot

Table 12 Components needed to deploy the Second SC6 Pilot

Module Task Responsible

Spark over HDFS 4store

Flume Kafka

BDI dockers made available by WP4 FH TF InfAI

NCSR-D SWC

Data storage schema To be extended for the pilot SWC

Metadata extraction Parsers for different data sources will

be developed for the pilot

SWC

15 Cf httpsd3jsorg

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 36: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

36

GraphSearch GUI To be configured for the pilot SWC

Table 12 Components needed to deploy the Second SC6 Pilot

8 Second SC7 Pilot Deployment

81 Use cases

The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash

Protecting freedom and security of Europe and its citizens

The pilot demonstrates the following workflows

1 Event detection workflow News sites and social media are monitored and processed

in order to extract and localize information about events Events are categorized and

the information from them is extracted the end-user is notified about the area interested

by the news and can visualize the events information together with the changes

detected by the other workflow (if activated)

2 Change detection workflow The end user selects a relevant Area of Interest With

respect to the selected dates two satellite images (earliest and latest) of these areas

are downloaded from ESA Sentinels Scientific Data Hub and processed in order to

detect changes The end-user is notified about detected changes and can view the

images and event information about this area

The second cycle of the SC7 pilot will extend the functionality and improve the performance of

the first cycle of the pilot (cf D52 Section 8)

Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-

based Twitter API to retrieve tweets based on pre-defined keywords To further support the

keyword-based search the second cycle of the pilot will also include a full-text indexing engine

The following outputs are made available for visualization or further processing

Relevant news related to specific keywords together with the corresponding Area of

Interested

Detected changes

Moreover the event detection workflow will be extended in order to automatically activate the

change detection workflow These changes are depicted in the updated architecture diagram

in Figure 7

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 37: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

37

82 Requirements

Table 13 lists the ingestion storage processing and output requirements set by the second

cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements

of the first pilot also apply Table 13 lists only the new requirements

Table 13 Requirements of the Second SC7 Pilot

Requirement Comment

R1 Monitor keyword-based text services

(Twitter) Text is retrieved and stored

together with provenance and any

metadata provided by the service

(notably location)

The NOMAD data connectors to Twitter

and Reuters will be adapted to access the

keyword search API of Twitter and store to

Cassandra

R2 Regularly execute event detection

using Spark over the most recent text

batch

Event detection is part of the ingestion

process and adds annotations to the text

data not part of the distributed processing

R3 Improve the speed of the change

detection workflow

Optimize the scalability of the operators

developed in Apache Spark for the change

detection workflow

R4 Extend change detection workflow to

improve accuracy

Fundamental SNAP operators (eg Subset

and Terrain Correction) for Sentinel 1 will be

adapted to Apache Spark

R5 Areas of Interest are automatically

defined by event detection

The Sentinel data connector is

parametrized from the event detection

module with a GIS shape

R6 End-user interface is based on Sextant Improvement of Sextant functionalities to

improve the user experience

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 38: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

38

R7 Users must be authenticated and

authorized to access the pilot data

Sextant will be extended in order to support

authentication and authorization

Table 13 Requirements of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

Figure 7 Architecture of the Second SC7 Pilot

83 Architecture

To satisfy the requirements above the following modules will be deployed

Storage infrastructures

HDFS for storing satellite images

Cassandra for storing news and tweets content and metadata

Lucene for storing GADM dataset ie the administrative areas together with their geo-

locations

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 39: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

39

Strabon for storing geo-locations of detected changes and location metadata about

news and tweets

Processing infrastructures

Spark will be made available for improving the change detection module and

developing the event detection module

Data integration

Semagrow will federate Strabon and Cassandra to provide the user interface with

homogeneous access to both data stores

Other modules

Twitter data connector

Reuters RSS feed reader

The Sentinel Data Aggregator receives as input the set of areas of interest and submits

a suitable query to the Sentinels Scientific Data Hub

Sextant as the user interface

84 Deployment

Table 14 lists the components provided to the pilot as part of BDI16 and components that will

be developed within WP6 in the context of executing the pilot

Table 14 Components needed to deploy the Second SC7 Pilot

Module Task Responsible

Big Data Integrator

HDFSHadoop Cassandra

Spark Semagrow Strabon

SOLR

BDI dockers made available by WP4 FH TF InfAI

NCSR-D UoA

SwC

Cassandra and Strabon

stores

The schema needs to be altered to

support tweets by keyword

NCSR-D and

UoA

Change detection module Spark code to be developed for UoA

16 Cf httpsgithubcombig-data-europeREADMEwikiComponents

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 40: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

40

extending and improving the change

detection algorithm

Event Detection module Spark code to be developed to scale

the event detection algorithm

NCSR-D

Twitter data connector To be extended to access the keyword

search Twitter API

NCSR-D

User interface To be enhanced for the pilot UoA

Table 14 Components needed to deploy the Second SC7 Pilot

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round

Page 41: Data Technologies · consulting about the platform. This task includes two phases: the design and the deployment phase. The design phase involves the following. Review the pilot descriptions

D54 ndash v 100

Page

41

9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic

Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting

round The relevant work in this task is to ensure that the components are within the scope

of what is prepared in WP4 and that they interoperate and can be used in the same

application

All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure

and provided to the piloting partners as a basis for their piloting applications which will be

developed in WP6 As a result of this preliminary testing and the interaction between the

technical partners and the piloting partners some of the original pilot descriptions have

been refined and fully specified and their usage of BDI components has been clarified This

ensures that the pilot descriptions are consistent with the first public release of the BDI

platform (D42) and can be reproduced by interested third parties

Work in this task (Task 52) will proceed as follows

During the second pilot deployment phase work in this task will follow and document

development of the individual components and test their integration into the platform

During the third pilot deployment phase work in this task will prepare the next version

of this document regarding the BDI instances needed for the third piloting round