Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Project funded by the European Unionrsquos Horizon 2020 Research and Innovation Programme (2014 ndash 2020)
Support Action
Big Data Europe ndash Empowering Communities with
Data Technologies
Project Number 644564 Start Date of Project 01012015 Duration 36 months
Deliverable 54
Domain-Specific Big Data Integrator Instances II
Dissemination Level Public
Due Date of Deliverable Month 24 01012017
Actual Submission Date 24022017
Work Package WP5 Big Data Integrator Instances
Task T52
Type Other
Approval Status Approved
Version 100
Number of Pages 41
Filename D54_Domain-Specific Big Data Integrator
Instances II
Abstract Documentation of the Big Data Integrator Instances deployed for executing
the pilots of the Big Data Integrator across all seven societal challenges
The information in this document reflects only the authorrsquos views and the European Community is not liable for any use
that may be made of the information contained therein The information in this document is provided ldquoas isrdquo without
guarantee or warranty of any kind express or implied including but not limited to the fitness of the information for a
particular purpose The user thereof uses the information at his her sole risk and liability
Ref Ares(2017)1004145 - 24022017
D54 ndash v 100
Page
2
History
Version Date Reason Revised by
000 21012016 Document structure S Konstantopoulos
001 2122016 First draft based on descriptions of
second piloting cycle
A Charalambidis
S Konstantopoulos G
Stavrinos and I
Klampanos
002 1212017 Second draft based on the
feedback of the piloting partners
A Charalambidis
S Konstantopoulos
and V Karkaletsis
003 1212017 Corrections and comments for SC2
pilot
P Karampiperis
and P Zervas
004 1312017 Corrections and comments for SC7
pilot
G Papadakis
M Lazarrini
005 2212017 Corrections for SC1 pilot B Williams-Jones
K McNeice
006 2512017 SC3 pilot architecture A Charalambidis and I
Mouchakis
007 122017 Corrections and clarifications for
SC3
A Charalambidis and F
Mouzakis
008 922017 Peer review A Versteden and H
Jabeen
009 1622017 Address peer review comments
A Charalambidis
S Konstantopoulos
I Klampanos
J Jakobitsch and
K McNeice
100 2322017 Final version
D54 ndash v 100
Page
3
Author List
Organisation Name Contact Information
NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr
NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr
NCSR-D Angelos Charalambidis acharaliitdemokritosgr
NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr
NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr
Agroknow Pythagoras Karampiperis pythkagroknowcom
Agroknow Panagiotis Zervas pzervasagroknowcom
Open
PHACTS Bryn Williams-Jones brynopenphactsfoundationorg
Open
PHACTS Kiera McNeice kieraopenphactsfoundationorg
CRES Fragkiskos Mouzakis mouzakiscresgr
TenForce Aad Versteden aadverstedentenforcecom
UoB Hajira Jabeen hajirajabeengmailcom
SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat
D54 ndash v 100
Page
4
Executive Summary
This report documents the instantiations of the Big Data Integrator Platform that underlies the
pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon
2020 Societal Challenges These platform instances will be provided to the relevant networking
partners to be used for executing the pilot sessions foreseen in WP6
For each of the seven pilots this document provides (a) a brief summary of the pilot description
prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the
technical requirements for carrying out these use cases (c) an architecture that shows the BDI
components required to cover these requirements and (d) the list of components in the
architecture and their status (available as part of BDI or otherwise available or to be
developed as part of the pilot)
D54 ndash v 100
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe
The components that are made available to the pilots by BDI are listed
here httpsgithubcombig-data-europeREADMEwikiComponents
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1 Health Demographic Change and Wellbeing
SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry
Marine Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3 Secure Clean and Efficient Energy
SC4 Societal Challenge 4 Smart Green and Integrated Transport
SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative
and reflective societies
SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security
of Europe and its citizens
AK Agroknow Belgium
CERTH Centre for Research and Technology Greece
CRES Center for Renewable Energy Sources and Saving Greece
FAO Food and Agriculture Organization of the United Nations Italy
FhG Fraunhofer IAIS Germany
InfAI Institute for Applied Informatics Germany
NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece
OPF Open PHACTS Foundation UK
SWC Semantic Web Company Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam the Netherlands
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
2
History
Version Date Reason Revised by
000 21012016 Document structure S Konstantopoulos
001 2122016 First draft based on descriptions of
second piloting cycle
A Charalambidis
S Konstantopoulos G
Stavrinos and I
Klampanos
002 1212017 Second draft based on the
feedback of the piloting partners
A Charalambidis
S Konstantopoulos
and V Karkaletsis
003 1212017 Corrections and comments for SC2
pilot
P Karampiperis
and P Zervas
004 1312017 Corrections and comments for SC7
pilot
G Papadakis
M Lazarrini
005 2212017 Corrections for SC1 pilot B Williams-Jones
K McNeice
006 2512017 SC3 pilot architecture A Charalambidis and I
Mouchakis
007 122017 Corrections and clarifications for
SC3
A Charalambidis and F
Mouzakis
008 922017 Peer review A Versteden and H
Jabeen
009 1622017 Address peer review comments
A Charalambidis
S Konstantopoulos
I Klampanos
J Jakobitsch and
K McNeice
100 2322017 Final version
D54 ndash v 100
Page
3
Author List
Organisation Name Contact Information
NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr
NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr
NCSR-D Angelos Charalambidis acharaliitdemokritosgr
NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr
NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr
Agroknow Pythagoras Karampiperis pythkagroknowcom
Agroknow Panagiotis Zervas pzervasagroknowcom
Open
PHACTS Bryn Williams-Jones brynopenphactsfoundationorg
Open
PHACTS Kiera McNeice kieraopenphactsfoundationorg
CRES Fragkiskos Mouzakis mouzakiscresgr
TenForce Aad Versteden aadverstedentenforcecom
UoB Hajira Jabeen hajirajabeengmailcom
SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat
D54 ndash v 100
Page
4
Executive Summary
This report documents the instantiations of the Big Data Integrator Platform that underlies the
pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon
2020 Societal Challenges These platform instances will be provided to the relevant networking
partners to be used for executing the pilot sessions foreseen in WP6
For each of the seven pilots this document provides (a) a brief summary of the pilot description
prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the
technical requirements for carrying out these use cases (c) an architecture that shows the BDI
components required to cover these requirements and (d) the list of components in the
architecture and their status (available as part of BDI or otherwise available or to be
developed as part of the pilot)
D54 ndash v 100
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe
The components that are made available to the pilots by BDI are listed
here httpsgithubcombig-data-europeREADMEwikiComponents
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1 Health Demographic Change and Wellbeing
SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry
Marine Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3 Secure Clean and Efficient Energy
SC4 Societal Challenge 4 Smart Green and Integrated Transport
SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative
and reflective societies
SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security
of Europe and its citizens
AK Agroknow Belgium
CERTH Centre for Research and Technology Greece
CRES Center for Renewable Energy Sources and Saving Greece
FAO Food and Agriculture Organization of the United Nations Italy
FhG Fraunhofer IAIS Germany
InfAI Institute for Applied Informatics Germany
NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece
OPF Open PHACTS Foundation UK
SWC Semantic Web Company Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam the Netherlands
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
3
Author List
Organisation Name Contact Information
NCSR-D Ioannis Mouchakis gmouchakisiitdemokritosgr
NCSR-D Stasinos Konstantopoulos konstantiitdemokritosgr
NCSR-D Angelos Charalambidis acharaliitdemokritosgr
NCSR-D Georgios Stavrinos gstavrinosiitdemokritosgr
NCSR-D Vangelis Karkaletsis vangelisiitdemokritosgr
Agroknow Pythagoras Karampiperis pythkagroknowcom
Agroknow Panagiotis Zervas pzervasagroknowcom
Open
PHACTS Bryn Williams-Jones brynopenphactsfoundationorg
Open
PHACTS Kiera McNeice kieraopenphactsfoundationorg
CRES Fragkiskos Mouzakis mouzakiscresgr
TenForce Aad Versteden aadverstedentenforcecom
UoB Hajira Jabeen hajirajabeengmailcom
SWC Juumlrgen Jakobitsch jjakobitschsemantic-webat
D54 ndash v 100
Page
4
Executive Summary
This report documents the instantiations of the Big Data Integrator Platform that underlies the
pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon
2020 Societal Challenges These platform instances will be provided to the relevant networking
partners to be used for executing the pilot sessions foreseen in WP6
For each of the seven pilots this document provides (a) a brief summary of the pilot description
prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the
technical requirements for carrying out these use cases (c) an architecture that shows the BDI
components required to cover these requirements and (d) the list of components in the
architecture and their status (available as part of BDI or otherwise available or to be
developed as part of the pilot)
D54 ndash v 100
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe
The components that are made available to the pilots by BDI are listed
here httpsgithubcombig-data-europeREADMEwikiComponents
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1 Health Demographic Change and Wellbeing
SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry
Marine Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3 Secure Clean and Efficient Energy
SC4 Societal Challenge 4 Smart Green and Integrated Transport
SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative
and reflective societies
SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security
of Europe and its citizens
AK Agroknow Belgium
CERTH Centre for Research and Technology Greece
CRES Center for Renewable Energy Sources and Saving Greece
FAO Food and Agriculture Organization of the United Nations Italy
FhG Fraunhofer IAIS Germany
InfAI Institute for Applied Informatics Germany
NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece
OPF Open PHACTS Foundation UK
SWC Semantic Web Company Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam the Netherlands
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
4
Executive Summary
This report documents the instantiations of the Big Data Integrator Platform that underlies the
pilot applications that will be prepared in WP6 for serving exemplary use cases of the Horizon
2020 Societal Challenges These platform instances will be provided to the relevant networking
partners to be used for executing the pilot sessions foreseen in WP6
For each of the seven pilots this document provides (a) a brief summary of the pilot description
prepared in WP6 and especially of the use cases provided in the pilot descriptions (b) the
technical requirements for carrying out these use cases (c) an architecture that shows the BDI
components required to cover these requirements and (d) the list of components in the
architecture and their status (available as part of BDI or otherwise available or to be
developed as part of the pilot)
D54 ndash v 100
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe
The components that are made available to the pilots by BDI are listed
here httpsgithubcombig-data-europeREADMEwikiComponents
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1 Health Demographic Change and Wellbeing
SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry
Marine Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3 Secure Clean and Efficient Energy
SC4 Societal Challenge 4 Smart Green and Integrated Transport
SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative
and reflective societies
SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security
of Europe and its citizens
AK Agroknow Belgium
CERTH Centre for Research and Technology Greece
CRES Center for Renewable Energy Sources and Saving Greece
FAO Food and Agriculture Organization of the United Nations Italy
FhG Fraunhofer IAIS Germany
InfAI Institute for Applied Informatics Germany
NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece
OPF Open PHACTS Foundation UK
SWC Semantic Web Company Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam the Netherlands
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
5
Abbreviations and Acronyms
BDI
The Big Data Integrator platform that is developed within Big Data Europe
The components that are made available to the pilots by BDI are listed
here httpsgithubcombig-data-europeREADMEwikiComponents
BDI
Instance
A specific deployment of BDI complemented by tools specifically
supporting a given Big Data Europe pilot
BT Bluetooth
ECMWF European Centre for Medium range Weather Forecasting
ESGF Earth System Grid Federation
FCD Floating Car Data
LOD Linked Open Data
SC1 Societal Challenge 1 Health Demographic Change and Wellbeing
SC2 Societal Challenge 2 Food Security Sustainable Agriculture and Forestry
Marine Maritime and Inland Water Research and the Bioeconomy
SC3 Societal Challenge 3 Secure Clean and Efficient Energy
SC4 Societal Challenge 4 Smart Green and Integrated Transport
SC5 Societal Challenge 5 Climate Action Environment Resource Efficiency
and Raw Materials
SC6 Societal Challenge 6 Europe in a changing world ndash Inclusive innovative
and reflective societies
SC7 Societal Challenge 7 Secure societies ndash Protecting freedom and security
of Europe and its citizens
AK Agroknow Belgium
CERTH Centre for Research and Technology Greece
CRES Center for Renewable Energy Sources and Saving Greece
FAO Food and Agriculture Organization of the United Nations Italy
FhG Fraunhofer IAIS Germany
InfAI Institute for Applied Informatics Germany
NCSR-D National Center for Scientific Research ldquoDemokritosrdquo Greece
OPF Open PHACTS Foundation UK
SWC Semantic Web Company Austria
UoA National and Kapodistrian University of Athens
VU Vrije Universiteit Amsterdam the Netherlands
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
6
Table of Contents 1 Introduction 9
11 Purpose and Scope 9
12 Methodology 9
2 Second SC1 Pilot Deployment 10
21 Use Cases 10
22 Requirements 10
23 Architecture 12
24 Deployment 12
3 Second SC2 Pilot Deployment 14
31 Overview 14
32 Requirements 15
33 Architecture 17
34 Deployment 18
4 Second SC3 Pilot Deployment 20
41 Overview 20
42 Requirements 21
43 Architecture 22
44 Deployment 23
5 Second SC4 Pilot Deployment 24
51 Use cases 24
52 Requirements 24
53 Architecture 26
54 Deployment 27
6 Second SC5 Pilot Deployment 28
61 Use cases 28
62 Requirements 29
63 Architecture 30
64 Deployment 31
7 Second SC6 Pilot Deployment 32
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
7
71 Use cases 32
72 Requirements 33
73 Architecture 34
74 Deployment 35
8 Second SC7 Pilot Deployment 36
81 Use cases 36
82 Requirements 37
83 Architecture 38
84 Deployment 39
9 Conclusions 41
List of Tables
Table 1 Requirements of the Second SC1 Pilot 11
Table 2 Components needed to Deploy Second SC1 Pilot 13
Table 3 Requirements of the Second SC2 Pilot 16
Table 4 Components needed to deploy the Second SC2 Pilot 19
Table 5 Requirements of the Second SC3 Pilot 21
Table 6 Components needed to deploy the Second SC3 Pilot 23
Table 7 Requirements of the Second SC4 Pilot 25
Table 8 Components needed to deploy the Second SC4 Pilot 28
Table 9 Requirements of the Second SC5 Pilot 29
Table 10 Components needed to deploy the Second SC5 Pilot 31
Table 11 Requirements of the Second SC6 Pilot 33
Table 12 Components needed to deploy the Second SC6 Pilot 36
Table 13 Requirements of the Second SC7 Pilot 38
Table 14 Components needed to deploy the Second SC7 Pilot 40
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
8
List of Figures
Figure 1 Architecture of the Second SC1 Pilot 12
Figure 2 Architecture of the Second SC2 Pilot 17
Figure 3 Architecture of the Second SC3 Pilot 22
Figure 4 Architecture of the Second SC4 Pilot 26
Figure 5 Architecture of the Second SC5 Pilot 30
Figure 6 Architecture of the Second SC6 Pilot 34
Figure 7 Architecture of the Second SC7 Pilot 38
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
9
1 Introduction
11 Purpose and Scope
This report documents the instantiations of the Big Data Integrator Platform (BDI) for serving
the needs of the domains examined within Big Data Europe These platform instances will be
provided to the relevant networking partners to execute the pilots foreseen in WP6
12 Methodology
Task 52 focuses on the application of the generic Instantiation methodology in a specific Use
Case pertaining to domains closely related to Europersquos Social challenges To this end T52
comprises seven (7) distinct sub-tasks each one dedicated to a different domain of application
Participating partners and their role NCSR-D (task leader) deploys the different instantiations
of the Big Data Integrator Platform and supports the partners carrying out each pilot with
consulting about the platform This task includes two phases the design and the deployment
phase The design phase involves the following
Review the pilot descriptions prepared in WP6 and request clarifications where needed
in order to prepare a detailed technical description of the platform that will support the
pilot
Prepare a first draft of the sections for the second cycle pilots where use cases and
workflow from the pilot descriptions are summarized and technical requirements and
an architecture for each pilot-specific platform is drafted
Cooperate with the persons responsible for each pilot to update the pilot description
and the technical description in this deliverable so that they are consistent and
satisfactory This draft also includes a list of components and their availability (a) base
platform components that are prepared in WP4 (b) pilot-specific components that are
already available or (c) pilot-specific components that will be developed for the pilot
Components are also assigned a partner responsible for their implementation
Review the pilot technical descriptions from the perspective of bridging between
technical work and the community requirements to establish that the pilot is relevant
to the communities it is aimed at
During deployment phase work in this task will follow and document development of the
individual components and test their integration into the platform
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
10
2 Second SC1 Pilot Deployment
21 Use Cases
The pilot is carried out by OPF and VU in the frame of SC1 Health Demographic Change and
Wellbeing
The pilot demonstrates the workflow of reproducing the functionality of an existing data
integration and processing system (the Open PHACTS Discovery Platform) on BDI The
second pilot extends the first pilot (cf D52 Section 2) with the following
Discussions with stakeholders and other Societal Challenges will identify how the
existing Open PHACTS platform and datasets may potentially be used to answer
queries in other domains In particular applications in Societal Challenge 2 (food
security and sustainable agriculture) where the effects of chemistry (eg pesticides)
on biology are probed in plants could exploit the linked data services currently within
the OPF platform This will require discussing use case specifics with SC2 to
understand their requirements and ensure that the OPF data is applicable Similarly
we will explore whether SC2 data could be linked to the OPF data platform is relevant
for early biology research
No specific new datasets are targeted for integration in the second pilot However if
datasets to be made available through other pilots have clear potential links to Open
PHACTS datasets these will be considered for integration into the platform to offer
researchers the ability to pose more complex queries across a wider range of data
The second pilot will aim to expand on first pilot by refreshing the datasets integrated
into the pilot Homogenising and integrating the new data available for these datasets
and developing ways to update datasets by integrating new data on an ongoing basis
will enable new use cases where researchers require fully current datasets for their
queries
The second pilot will also simplify existing workflows for querying the API for example
with components for common software tools such as KNIME reducing the barrier for
academic institutions and companies to access the platform for knowledge- and data-
driven biomedical research use cases
22 Requirements
Table 1 lists the ingestion storage processing and output requirements set by this pilot
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
11
Requirement Comment
R1 The solution should be
packaged in a way such that it is
possible to combine the Open
PHACTS Docker and the BDE
platform to achieve a custom
integrated solution
Specificities of the services of the Open PHACTS
Discovery Platform should not be hard-wired into
the domain-specific instance but should be read
from a configuration file (such as SWAGGER)
The BDE instance should offer or apply these
external services over data hosted by the BDE
instance
R2 RDF data storage The current Open PHACTS Discovery Platform is
based on distributed Virtuoso a proprietary
solution The BDE platform will provide a
distributed 4store and SANSA to be compared
with the Open PHACTS Discovery Platform
R3 Datasets are aligned and linked
at data ingestion time and the
transformed data is stored
In conjunction with R1 a modular data ingestion
component should dynamically decide which data
transformers to invoke
R4 Data and query security and
privacy requirements
A BDI local deployment holds private data and
serves private queries BDE does not foresee any
specific technical support for query obfuscation
so remote data sources need to be cloned locally
to guarantee query privacy
Table 1 Requirements of the Second SC1 Pilot
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
12
Figure 1 Architecture of the Second SC1 Pilot
Figure 1 Architecture of the Second SC1 pilot
23 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
Distributed triple store for the data The second pilot cycle will also test the feasibility of
using SANSA stack1 as an alternative of SPARQL query processing
Processing infrastructures
Scientific Lenses query expansion
Other modules
Data connector including the data transformation modules for the alignment of data at
ingestion time
REST API for querying that builds a SPARQL query by using keywords to fill in pre-
defined query templates The querying services also uses Scientific Lenses to expand
queries
24 Deployment
Table 2 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
1 httpsansa-stacknet
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
13
Table 2 Components needed to Deploy Second SC1 Pilot
Module Task Responsible
4store BDI dockers made available by WP4 NCSR-D
SANSA stack BDI dockers made available by WP4 FhGUniBonn
Data connector and
transformation modules
Develop a dynamic transformation
engine that uses SWAGGER
descriptions to select the appropriate
transformer
VU
Query endpoint Develop a dynamic query re-write
engine that uses SWAGGER
descriptions to select the transformer
VU
Scientific Lenses query
expansion module
Needs to be deployed and tested
unless an existing live service will be
used for the BDE pilot
VU
Table 2 Components needed to Deploy Second SC1 Pilot
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
14
3 Second SC2 Pilot Deployment
31 Overview
The pilot is carried out by AK FAO and SWC in the frame of SC2 Food Security Sustainable
Agriculture and Forestry Marine Maritime and Inland Water Research and the Bioeconomy
The second pilot cycle builds upon the first pilot cycle (cf D51 Section 3) expanding the
relevant data sources and extending the data processing needed to handle a variety of data
types (apart from bibliographic data) relevant to Viticulture
The pilot demonstrates the following workflows
1 Text mining workflow Automatically annotating scientific publications by (a) extracting
named entities (locations domain terms) and (b) extracting the captions of images
figures and tables The extracted information is provided to viticultural researchers via
a GUI that exposes search functionality
2 Data processing workflow The end users (viticultural researchers) upload scientific
data in a variety of formats and provide the metadata needed in order to correctly
interpret the data The data is ingested and homogenized so that it can be compared
and connected with other relevant data originally in diverse formats The data is
exposed to viticultural researchers via a GUI that exposes searchdiscovery
aggregation analysis correlation and visualization functionalities over structured data
The results of the data analysis will be stored in the infrastructure to avoid carrying out
the same processing multiple times with appropriate provence for future reference
publication and scientific replication
3 Phenologic modeling workflow that is the scheduling of agricultural operations (eg
pruning harvesting etc) by cross-examining the weather data observed in the area of
the vineyard with the appropriate weather conditions needed for the aforementioned
operations
4 Variety identification workflow The end users complete an on-spot questionnaire
regarding the characteristics of a specific grape variety Together with the geolocation
of the questionnaire this information is used to identify a grape variety
The following datasets will be involved
The AGRIS and PubMed datasets that include scientific publications
Weather data available via publicly-available API such as AccuWeather
OpenWeatherMap Weather Underground
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
15
User-generated data such as geotagged photos from leaves young shoots and grape
clusters ampelographic data SSR-marker data that will be provided by the VITIS
application
OIV Descriptor List2 for Grape Varieties and Vitis species
Crop Ontology
The following processing is carried out
Named entity extraction
Researcher affiliation extraction and verification
Variety identification
Phenologic modelling
PDF structure processing to associate tables and diagrams with captions
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information topics extracted from scientific publications
Metadata for dataset searching and discovery
Aggregation analysis correlation results
32 Requirements
Table 3 lists the ingestion storage processing and output requirements set by this pilot
Table 3 Requirements of the Second SC2 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results and their lineage
metadata When starting up processing
modules should check at the metadata
registry if intermediate results are available
R2 Extracting images and their captions
from scientific publications
To be developed for the pilot taking into
account R1
2 httpwwwoivinten
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
16
R3 Extracting thematic annotations from
text in scientific publications
To be developed for the pilot taking into
account R1
R4 Extracting researcher affiliations from
the scientific publications
To be developed for the pilot taking into
account R1
R5 Variety identification To be developed for the pilot taking into
account R1
R6 Phenolic modeling To be developed for the pilot taking into
account R1
R5 Expose data and metadata in JSON
through a Web API
Data ingestion module should write JSON
documents in HDFS 4store should be
accessed via a SPARQL endpoint that
responds with results in JSON
Table 3 Requirements of the Second SC2 Pilot
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
17
Figure 2 Architecture of the Second SC2 Pilot
Figure 2 Architecture of the Second SC2 Pilot
33 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing publication full-text and ingested datasets
A graph database for storing publication metadata (terms and named entities)
affiliation metadata (connections between researchers) weather metadata and VITIS
metadata
Processing infrastructures
Metadata extraction Spark or UnifiedViews3 are used to extract RDF metadata from
publication full-text These tools will react on Kafka messages Spark and UnifiedViews
will be evaluated for this task
3 Cf httpwwwunifiedviewseu
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
18
PoolParty A SKOS Thesaurus4 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite5 will be used Additional enrichment
of the dataset will be explored eg via linking to DBpedia or other LOD sources
AKSTEM the process of discovering relations and associations between organizations
and people in the field of viticulture research
Phenolic Modeling algorithm already developed in AK VITIS will be adapted to work in
the context of an Apache Spark application
Variety Identification already developed in AK VITIS will be adapted to work in the
context of an Apache Spark application
Extraction of images and figures and their captions from publication PDFs
Data analysis which writes analysis results back into the infrastructure to be retrieved
for visualization Data analysis should accompany each write-back with appropriate
metadata that specify the processing lineage of the derived dataset Intermediate
results should also be written out (and described as such in the metadata) in order to
allow resuming processing after a failure
Other modules
Flume for publication ingestion For every source that will be ingested into the system
there will be a flume agent responsible for data ingestion and basic
modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
34 Deployment
Table 4 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 4 Components needed to deploy the Second SC2 Pilot
Module Task Responsible
Spark over HDFS Flume
Kafka
BDI dockers made available by WP4 FH TF InfAI
SWC
4 Cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 5 Cf httpwwwpoolpartybiz
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
19
GraphDB andor Neo4j
dockerization
To be investigated if the Docker
images provided by the official
systems6 are suitable for the pilot If
not will be altered for the pilot or use
an already dockerized triple store such
as Virtuoso or 4store
SWC
Flume agents for publication
ingestion and processing
To be developed for the pilot SWC
Flume agents for data
ingestion
To be extended for the pilot in order to
support the introduced datasets
(accuweather data user-generated
data)
SWC AK
Data storage schema To be developed for the pilot SWC AK
Phenolic modelling To be adapted from AK VITIS for the
pilot
AK
Spark AKSTEM To be adapted from AK STEM for the
pilot
AK
Variety Identification To be adapted from AK VITIS for the
pilot
AK
Table 4 Components needed to deploy the Second SC2 Pilot
6 httpsneo4jcomdeveloperdocker
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
20
4 Second SC3 Pilot Deployment
41 Overview
The pilot is carried out by CRES in the frame of SC3 Secure Clean and Efficient Energy
The second pilot cycle extends the first pilot by adding additional online and offline data
analysis on raw data regarding Acoustic Emissions (AE) sensors and aggregated data such
as parametrics from continuous monitoring systems (CMS) The pilot demonstrates the
following workflow a developer in the field of wind energy enhances condition monitoring for
each unit in a wind farm by pooling together data from multiple units from the same farm (to
consider the cluster operation in total) and third party data (to perform correlated assessment)
The custom analysis modules created by the developer use both raw data that are transferred
offline to the processing cluster and condensed data streamed online at the same time order
that the event occurs
The following datasets are involved
Raw sensor and SCADA data from a given wind farm
Online stream data comprised of parametrics and statistics extracted from the raw
SCADA data
Raw sensor data from Acoustic Emissions module from a given wind farm
All data is in custom binary or ASCII formats ASCII files contain a metadata header and in
tabulated form the signal data (signal in columns time sequence in rows) All data is annotated
by location time and system id
The following processing is carried out
Near-real time execution of parametrized models to return operational statistics
warnings including correlation analysis of data across units
Weekly execution of operational statistics
Weekly execution of model parametrization
Weekly specific acoustic emissions DSP
The following outputs are made available for visualization or further processing
Operational statistics near-real time and weekly
Model parameters
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
21
42 Requirements
Table 5 lists the ingestion storage processing and output requirements set by this pilot Since
the second cycle of the pilot extends the first pilot some requirements are identical and
therefore omitted from Table 5
Table 5 Requirements of Second SC3 Pilot
Requirement Comment
R1 The online data will be sent (via
OPC) from the intermediate
(local) processing level to BDI
A data connector must be developed that provides
for receiving OPC streams from an OPC-
compatible server
R2 The application should be able
to recover from short outages by
collecting the data transmitted
during the outage from the data
sources
An OPC data connector must be developed that
can retrieve the missing data collected at the
intermediate level from the distributed data
historian systems
R3 Near-realtime execution of
parametrized models to return
operational statistics including
correlation analysis of data
across units
The analysis software should write its results back
into a specified format and data model that is
appropriate input for further analysis
R4 The GUI supports database
querying and data visualization
for the analytics results
The GUI will be able to access files in the format
and data model
Table 5 Requirements of the Second SC3 Pilot
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
22
Figure 3 Architecture of the Second SC3 Pilot
Figure 3 Architecture of the Second SC3 Pilot
43 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS that stores binary blobs each holding a temporal slice of the complete data The
slicing parameters are fixed and can be applied at data ingestion time
A Postgres relational database to store the warnings operational statistics and the
output of the analysis The schema will be defined in a later
A Kafka broker that will distribute the continuous stream of CMS to model execution
Processing infrastructures
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
23
A processor that operates upon temporal slices of data
A Spark module that orchestrates the application of the processor on slices
A Spark streaming module that operates on the online data
Other modules
A data connector that offers an ingestion endpoint andor can retrieve from remote data
sources using the FTP protocol
A data connector that offers an ingestion endpoint that can retrieve an online stream
using OPC protocol and publish it to a Kafka topic
Data visualization that can visualize the data files stored in HDFS
44 Deployment
Table 6 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 6 Components needed to deploy the Second SC3 Pilot
Module Task Responsible
Spark HDFS Postgres
Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Acoustic Emissions DSP To be developed for the pilot CRES
OPC Data connector To be developed for the pilot CRES
Data visualization To be extended for the pilot CRES
Table 6 Components needed to deploy the Second SC3 Pilot
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
24
5 Second SC4 Pilot Deployment
51 Use cases
The pilot is carried out by FhG and CERTH in the frame of SC4 Smart Green and Integrated
Transport
The pilot demonstrates how to implement the workflow for ingesting processing and storing
stream and historical traffic data in a distributed environment The pilot demonstrates the
following workflows
The map matching of the Floating Car Data (FCD) stream that is generated by the taxi
fleet The FCD data that represents the position of cabs using latitude and longitude
coordinates must be map matched to the roads on which the cabs are driving in order
to infer the traffic conditions of the roads The map matching is done through an
algorithm using a geographical database and topological rules
The monitoring of the current traffic conditions that consumes the mapped FCD data
and infers the traffic conditions of the roads
The forecasting of future traffic conditions based on a model that is trained from
historical and real-time mapped FCD data
The second pilot is based upon the processing modules developed in the first pilot (cf D52
Section 5) namely the processing modules developed by CERTH to analyze traffic data
classify traffic conditions The second pilot will also develop the newly added workflow of the
traffic forecasting and model training that did not exist during the first pilot cycle
The data sources available for the pilot are
A near-real time stream Floating Car Data (FCD) generated by a fleet of 1200 taxis
containing information about the position speed and direction of the cabs
A historical database of recorded FCD data
A geographical database with information about the road network in Thessaloniki
The results of traffic monitoring and traffic forecasting are saved into a database for querying
statistics and visualizations
52 Requirements
Table 7 lists the ingestion storage processing and output requirements set by this pilot Since
the present pilot cycle is an extension of the first pilot the requirements of the first pilot also
apply Table 13 lists only the new requirements
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
25
Table 7 Requirements of the Second SC4 Pilot
Requirement Comment
R1 The pilot will enable the
evaluation of the present and
future traffic conditions (eg
congestion) within temporal
windows
The FCD map matched data are used to determine
the current traffic condition and to make predictions
within different time windows
R2 The traffic predictions will be
saved in a database
Traffic condition and prediction will be used for
queries statistics evaluation of the quality of
predictions visualizations
R3 The pilot can be started in two
configurations single node (for
development and testing) and
cluster (production)
It must be possible to run all the pilot components
in one single node for development and testing
purposes The cluster configuration must provide
cluster of any components messaging system
(Kafka) processing modules (Flink Spark
TensorFlow) storage (Postgres)
Table 7 Requirements of the Second SC4 Pilot
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
26
Figure 4 Architecture of the Second SC4 Pilot
Figure 4 Architecture of the Second SC4 Pilot
53 Architecture
The architecture of the pilot has been designed taking into consideration the data sources
mostly streams the processing steps needed and the information that needs to be computed
The pilot will ingest data from a near real-time FCD data stream from cabs and from historical
FCD data The FCD data needs to be preprocessed for map matching before being used for
classificationprediction
Apache Kafka will be used to distribute the computations as it provides a scalable fault
tolerant messaging system The processing of the data streams will be performed within
temporal windows Apache Flink will be used for the map matching algorithm in the same
manner as in the first cycle of the pilot Apache Spark or Tensorflow will be considered as a
platform to implement the traffic forecasting algorithm
The algorithms used for the map matching and classification will be provided using R as
it provides a good support for machine learning algorithms and because it is commonly used
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
27
and well known by researchers at CERTH In order to use the R packages in a Flink application
developed in Java the pilot will connect to R server (via Rserve) Recurrent Neural Networks
will be used for the traffic forecasting module
The traffic conditions and prediction computation will be stored in a scalable fault tolerant
database such as Elasticsearch The storage system must support spatial and temporal
indexing
54 Deployment
Table 8 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 8 Components needed to deploy Second SC4 Pilot
Module Task Responsible
PostGIS Elasticsearch
Kafka Flink Spark
TensorFlow
BDI dockers made available by WP4 NCSR-D SWC
TF FhG
A Kafka producer for FCD
data stream (source URL)
and historical data (source
file system)
Develop a Kafka producer to collect
the FCD data as a stream from web
services and from the file system for
the historical data sets and send them
to a Kafka topic
FhG
Kafka brokers Install Kafka to provide a message
broker and the topics
SWC
A Spark application for traffic
forecasting and model
training
Develop a Spark application that
consumes FCD matched data from a
Kafka topic The application will train a
prediction model and write the traffic
predictions to ElasticSearch
FhG
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
28
A Kafka consumer for storing
analysis results
Develop a Kafka consumer that stores
the result of the Traffic Classification
and prediction module
FhG
Table 8 Components needed to deploy the Second SC4 Pilot
6 Second SC5 Pilot Deployment
61 Use cases
The pilot is carried out by NCSR-D in the frame of SC5 Climate Action Environment Resource
Efficiency and Raw Materials
The pilot demonstrates the following workflow A (potentially hazardous) substance is released
in the atmosphere that results to increased readings in one or more monitoring stations The
user accesses a user interface provided by the pilot to define the locations of the monitoring
stations as well as a timeseries of the measured values (eg gamma dose rate) The platform
initiates
a weather matching algorithm that is a search for similarity of the current weather and
the pre-computed weather patterns as well as
a dispersion matching algorithm that is a search for similarity of the current substance
dispersion patterns with the precomputed ones
The weather patterns have been extracted in a pre-processing step by clustering weather
conditions recorded in the past while the substance dispersion patterns have been
precomputed by simulating different scenarios of substance release and weather conditions
The pre-computed patterns are stored in the BDE infrastructure and retrieved upon request
The following datasets are involved
NetCDF files from the European Centre for Medium range Weather Forecasting
(ECMWF7)
GRIB files from National Oceanic and Atmospheric Administration (NOAA8)
The following processing will be carried out
The weather clustering algorithm that creates clusters of similar weather conditions
implemented using the BDI platform (see Section 63)
7 httpappsecmwfintdatasets 8 httpswwwncdcnoaagovdata-accessmodel-datamodel-datasetsglobal-forcast-system-gfs
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
29
The WRF downscaling that takes as input a low resolution weather and creates a high
resolution weather
The DIPCOT (DIsPersion over COmplex Terrain) atmospheric dispersion model
computes dispersion patterns given predominant weather conditions
The following outputs are made available for visualization or further processing
The dispersions produced by DIPCOT
The weather clusters produced by the weather clustering algorithm
62 Requirements
Table 9 lists the ingestion storage processing and output requirements set by this pilot
Table 9 Requirements of Second SC5 Pilot
Requirement Comment
R1 Provide a means of downloading
currentevaluation weather from
ECMWF or alternative services
Data connectorinterface needs to be developed
R2 ECMWF and NOAA datasets are
compatible with the WRF and
DIPCOT naming conventions
A preprocessing WPS normalization step will
perform the necessary transformations and
variable renamings needs to ensure compatibility
R3 Retrieve NetCDF files from HDFS
as input to the weather clustering
algorithm
R4 Dispersion matching will filter on
dispersion values
Relational database will provide indexes on
dispersion values for efficient dispersion search
R5 Dispersion visualization Weather and dispersion matching must produce
output compatible with Sextantrsquos input or Sextant
must be modified to support new input
Table 9 Requirements of the Second SC5 Pilot
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
30
Figure 5 Architecture of the Second SC5 Pilot
Figure 5 Architecture of the Second SC5 Pilot
63 Architecture
To satisfy the requirements described above the following components will be deployed
Storage infrastructure
HDFS for storing NetCDF and GRIB files
Postgres for storing dispersions
Processing components
Scilearn-kit or TensorFlow to host the weather clustering algorithm
Other modules
ECMWF and NOAA data connectors
WPS normalization procedure
WRF downscaling component
DIPCOT atmospheric dispersion model
Weather and dispersion matching
Sextant for visualizing the dispersion layer
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
31
64 Deployment
Table 10 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 10 Components needed to deploy the Second SC5 Pilot
Module Task Responsible
HDFS Sextant Postgres BDI dockers made available by WP4 TF UoA NCSR-D
Scikit-learn TensorFlow To be developed in the pilot NCSR-D
DIPCOT To be packaged in the pilot NCSR-D
Weather clustering algorithm To be developed in the pilot NCSR-D
Weather matching To be developed in the pilot NCSR-D
Dispersion matching To be developed in the pilot NCSR-D
ECMWF and NOAA data
connector
To be developed in the pilot NCSR-D
Data visualization UI To be developed in the pilot NCSR-D
Table 10 Components needed to deploy the Second SC5 Pilot
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
32
7 Second SC6 Pilot Deployment
71 Use cases
The pilot is carried out by NCSR-D and SWC in the frame of SC6 Europe in a changing world
- inclusive innovative and reflective societies
The pilot demonstrates the following workflow Municipality economic data (ie budget and
budget execution data) are ingested at a regular basis (daily weekly and so on) from a series
of locations in a variety of structures and formats are homogenized so that they can be
compared analyzed and visualized in a comprehensible way The data is exposed to users
via a dashboard that exposes searchdiscovery aggregation analysis correlation and
visualization functionalities over structured data The results of the data analysis will be stored
in the infrastructure to avoid carrying out the same processing multiple times
The second cycle of the pilot will extend the first pilot by incorporating different formats by
developing a modular parsing library
The following datasets are involved
Budget execution data of Municipality of Athens
Budget execution data of Municipality of Thessaloniki
Budget execution data of Municipality of Barcelona
The current datasets involved are exposed either as an API or as CSV XML files
Datasets will be described by DCAT-AP9 metadata and the FIBO10 and FIGI11 ontologies
Statistical data will be described in the RDF DataCube12 vocabulary
The following processing is carried out
Data ingestion and homogenization
Aggregation analysis correlation over scientific data
The following outputs are made available for visualization or further processing
Structured information extracted from budget datasets exposed as a SPARQL endpoint
Metadata for dataset searching and discovery
9 Cf httpsjoinupeceuropaeuassetdcat_application_profiledescription 10 Cf httpwwwomgorgspecEDMC-FIBOFND10Beta1indexhtm 11 Cf httpwwwomgorghot-topicsfinancehtm 12 Cf httpswwww3orgTR2014REC-vocab-data-cube-20140116
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
33
Aggregation and analysis
72 Requirements
Table 11 lists the ingestion storage processing and output requirements set by this pilot
Table 11 Requirements of the Second SC6 Pilot
Requirement Comment
R1 In case of failures during data
processing users should be able to
recover data produced earlier in the
workflow along with their lineage and
associated metadata
Processing modules should periodically
store intermediate results When starting
up processing modules should check at the
metadata registry if intermediate results are
available
R2 Transform budget data into a
homogenized format using various
parsers
Parsers will be developed for the pilot
taking into account R1
R3 Expose data and metadata through a
SPARQL endpoint
The triple store should be accessed via a
SPARQL endpoint
R4 Intuitive easy-to-use interface for
searching and selecting relevant data
sources The use of the user interface
should be documented so that users
can ease into using it with as little
effort as possible
The GraphSearch UI will be used to create
visualizations from SPARQL queries
Table 11 Requirements of the Second SC6 Pilot
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
34
Figure 6 Architecture of the Second SC6 Pilot
Figure 6 Architecture of the Second SC6 Pilot
73 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing ingested datasets
4store for storing homogenized statistical data and dataset metadata
Processing infrastructures
Metadata extraction Spark is used to extract RDF data and metadata from budget
data These tools will react on Kafka messages
PoolParty A SKOS Thesaurus13 will be used to consolidatetranslate (linkmap) the
terms in the ingested documents (eg bio terms locations and other named entities)
For this step the SWC PoolParty Semantic Suite14 will be used as an external service
13 Please cf httpsenwikipediaorgwikiSimple_Knowledge_Organization_System 14 Please cf httpwwwpoolpartybiz
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
35
PoolParty is accessible from the BDE components via an HTTP API The connection
between Spark and PoolParty has been implemented in the first pilot cycle Additional
enrichment of the dataset will be explored eg via linking to DBpedia or other LOD
sources
Data analysis that will be performed on demand by pre-defined queries in the
dashboard
Other modules
Flume for dataset ingestion For every source that will be ingested into the system there
will be a flume agent responsible for data ingestion and basic modificationunification
Kafka as soon as a new record is available a Kafka message will be produced One
kafka consumer stores raw data into HDFS
A set of pre-defined SPARQL queries that carry out analytical aggregations important
comparisons and or other analysis of the data
GUI that provide functionality for (a) metadata searching to discover datasets data and
publications (b) linked data browsing (ie dereferencing entity descriptions in RDF) in
the form of a visual dashboard realised in d3js15
GraphSearch as the user interface
74 Deployment
Table 12 lists the components provided to the pilot as part of BDI and components that will be
developed within WP6 in the context of executing the pilot
Table 12 Components needed to deploy the Second SC6 Pilot
Module Task Responsible
Spark over HDFS 4store
Flume Kafka
BDI dockers made available by WP4 FH TF InfAI
NCSR-D SWC
Data storage schema To be extended for the pilot SWC
Metadata extraction Parsers for different data sources will
be developed for the pilot
SWC
15 Cf httpsd3jsorg
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
36
GraphSearch GUI To be configured for the pilot SWC
Table 12 Components needed to deploy the Second SC6 Pilot
8 Second SC7 Pilot Deployment
81 Use cases
The pilot is carried out by SatCen UoA and NCSR-D in the frame of SC7 Secure societies ndash
Protecting freedom and security of Europe and its citizens
The pilot demonstrates the following workflows
1 Event detection workflow News sites and social media are monitored and processed
in order to extract and localize information about events Events are categorized and
the information from them is extracted the end-user is notified about the area interested
by the news and can visualize the events information together with the changes
detected by the other workflow (if activated)
2 Change detection workflow The end user selects a relevant Area of Interest With
respect to the selected dates two satellite images (earliest and latest) of these areas
are downloaded from ESA Sentinels Scientific Data Hub and processed in order to
detect changes The end-user is notified about detected changes and can view the
images and event information about this area
The second cycle of the SC7 pilot will extend the functionality and improve the performance of
the first cycle of the pilot (cf D52 Section 8)
Apart from the datasets used in the first cycle of the pilot this cycle will also use the keyword-
based Twitter API to retrieve tweets based on pre-defined keywords To further support the
keyword-based search the second cycle of the pilot will also include a full-text indexing engine
The following outputs are made available for visualization or further processing
Relevant news related to specific keywords together with the corresponding Area of
Interested
Detected changes
Moreover the event detection workflow will be extended in order to automatically activate the
change detection workflow These changes are depicted in the updated architecture diagram
in Figure 7
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
37
82 Requirements
Table 13 lists the ingestion storage processing and output requirements set by the second
cycle of the pilot Since the present pilot cycle is an extension of the first pilot the requirements
of the first pilot also apply Table 13 lists only the new requirements
Table 13 Requirements of the Second SC7 Pilot
Requirement Comment
R1 Monitor keyword-based text services
(Twitter) Text is retrieved and stored
together with provenance and any
metadata provided by the service
(notably location)
The NOMAD data connectors to Twitter
and Reuters will be adapted to access the
keyword search API of Twitter and store to
Cassandra
R2 Regularly execute event detection
using Spark over the most recent text
batch
Event detection is part of the ingestion
process and adds annotations to the text
data not part of the distributed processing
R3 Improve the speed of the change
detection workflow
Optimize the scalability of the operators
developed in Apache Spark for the change
detection workflow
R4 Extend change detection workflow to
improve accuracy
Fundamental SNAP operators (eg Subset
and Terrain Correction) for Sentinel 1 will be
adapted to Apache Spark
R5 Areas of Interest are automatically
defined by event detection
The Sentinel data connector is
parametrized from the event detection
module with a GIS shape
R6 End-user interface is based on Sextant Improvement of Sextant functionalities to
improve the user experience
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
38
R7 Users must be authenticated and
authorized to access the pilot data
Sextant will be extended in order to support
authentication and authorization
Table 13 Requirements of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
Figure 7 Architecture of the Second SC7 Pilot
83 Architecture
To satisfy the requirements above the following modules will be deployed
Storage infrastructures
HDFS for storing satellite images
Cassandra for storing news and tweets content and metadata
Lucene for storing GADM dataset ie the administrative areas together with their geo-
locations
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
39
Strabon for storing geo-locations of detected changes and location metadata about
news and tweets
Processing infrastructures
Spark will be made available for improving the change detection module and
developing the event detection module
Data integration
Semagrow will federate Strabon and Cassandra to provide the user interface with
homogeneous access to both data stores
Other modules
Twitter data connector
Reuters RSS feed reader
The Sentinel Data Aggregator receives as input the set of areas of interest and submits
a suitable query to the Sentinels Scientific Data Hub
Sextant as the user interface
84 Deployment
Table 14 lists the components provided to the pilot as part of BDI16 and components that will
be developed within WP6 in the context of executing the pilot
Table 14 Components needed to deploy the Second SC7 Pilot
Module Task Responsible
Big Data Integrator
HDFSHadoop Cassandra
Spark Semagrow Strabon
SOLR
BDI dockers made available by WP4 FH TF InfAI
NCSR-D UoA
SwC
Cassandra and Strabon
stores
The schema needs to be altered to
support tweets by keyword
NCSR-D and
UoA
Change detection module Spark code to be developed for UoA
16 Cf httpsgithubcombig-data-europeREADMEwikiComponents
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
40
extending and improving the change
detection algorithm
Event Detection module Spark code to be developed to scale
the event detection algorithm
NCSR-D
Twitter data connector To be extended to access the keyword
search Twitter API
NCSR-D
User interface To be enhanced for the pilot UoA
Table 14 Components needed to deploy the Second SC7 Pilot
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round
D54 ndash v 100
Page
41
9 Conclusions This report analysed the pilot requirements and specifies the components of the the generic
Big Data Integrator Platform (BDI) that are required for each pilot of the second piloting
round The relevant work in this task is to ensure that the components are within the scope
of what is prepared in WP4 and that they interoperate and can be used in the same
application
All seven BDI instantiations have been deployed and tested at the NCSR-D infrastructure
and provided to the piloting partners as a basis for their piloting applications which will be
developed in WP6 As a result of this preliminary testing and the interaction between the
technical partners and the piloting partners some of the original pilot descriptions have
been refined and fully specified and their usage of BDI components has been clarified This
ensures that the pilot descriptions are consistent with the first public release of the BDI
platform (D42) and can be reproduced by interested third parties
Work in this task (Task 52) will proceed as follows
During the second pilot deployment phase work in this task will follow and document
development of the individual components and test their integration into the platform
During the third pilot deployment phase work in this task will prepare the next version
of this document regarding the BDI instances needed for the third piloting round