Cloud LSVA · V0.2 18-07-2016 G. Dubbelman Consolidated partner inputs ... Introduction ... the cloud infrastructure can be expanded in order to increase storage capacity and/or computing

Cloud LSVA

Large Scale Video Analysis

EUROPEAN COMMISSION

DG Communications Networks, Content & Technology

Horizon 2020 Research and Innovation Programme

Grant Agreement Nr 6880099

Import/export interfaces and

Annotation data model and storage

specification

Project funded by the European Union’s Horizon 2020 Research and Innovation Programme (2014 – 2020)

Deliverable no. D 3.1

Dissemination level Public

Work Package no. WP 3 Semi-automatic video annotation and search

Main author(s) G. Dubbelman

Co-author(s)

Version Nr (F: final, D:

draft)

F

File Name Cloud LSVA Deliverable 3.1

Project Start Date and

Duration

01 January 2016, 36 months

Ref. Ares(2016)4892397 - 31/08/2016

D3.1 V0.1

2

Document Control Sheet

Main author(s) or editor(s): Author Work area: WP 3 Semi-automatic video annotation and search Document title: [Title]

Version history:

Approval:

Name Date

Prepared G. Dubbelman 02-08-2016

Reviewed B. Rousseau 04-08-2016

Authorised O. Otaegui 22-08-2016

Circulation:

Recipient Date of submission

EC 31-08-2016

Cloud LSVA consortium 22-08-2016

Legal Disclaimer

The information in this document is provided “as is”, and no guarantee or warranty is given that the information is fit for any particular purpose. The above referenced consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law. © 2016 by Cloud LSVA Consortium.

Version number

Date Main author Summary of changes

v0.1 23-06-2016 G. Dubbelman n.a.

V0.2 18-07-2016 G. Dubbelman Consolidated partner inputs

v0.3 02-08-2016 G. Dubbelman Consolidated partner inputs

V0.4 04-08-2016 G. Dubbelman Reviewer comments processed

V1.0 11-08-2016 G. Dubbelman Final version

D3.1 V0.1

3

Abbreviations and Acronyms

Acronym Definition

EC European Commission

PO Project officer

GA Grant Agreement

WP Work Package

SAN Storage Area Network

NAS Network Attached Storage

OS Operating System

D3.1 V0.1

4

Table of Contents Executive Summary ................................................................................................................................ 6

1. Introduction ...................................................................................................................................... 7

1.1 Purpose of Document ............................................................................................................. 8

1.2 Intended audience ................................................................................................................... 8

2. Storage specification........................................................................................................................ 9

2.1 Storage server specification .................................................................................................... 9

2.2 Supported software and services .......................................................................................... 10

3. Data formats .................................................................................................................................. 12

3.1 Archive data types ................................................................................................................. 12

3.2 Image and video formats ....................................................................................................... 12

3.3 Annotation formats ................................................................................................................ 16

3.4 Positioning formats ................................................................................................................ 18

3.5 Map formats .......................................................................................................................... 19

3.6 Scenario Formats .................................................................................................................. 20

3.7 Machine learning model formats ........................................................................................... 20

3.8 Meta data formats ................................................................................................................. 21

4. Import and Export interfaces .......................................................................................................... 22

4.1 Upload Engine ....................................................................................................................... 22

4.2 Mobile Data Services ............................................................................................................ 22

4.3 Bulk Data Services ................................................................................................................ 25

5. Conclusion ..................................................................................................................................... 26

Annexes ................................................................................................................................................ 27

1. Annex A: JSON Examples ........................................................................................................ 27

D3.1 V0.1

5

List of Figures Figure 1: Cloud-LSVA architecture. ........................................................................................................ 7 Figure 2: Top-level data storage architecture. ........................................................................................ 9 Figure 5: RoadDNA illustration.............................................................................................................. 19 Figure 4: LBDO framework overview. ................................................................................................... 23

List of Tables Table 1: Local cloud storage system specification. ................................................................................ 9 Table 2: SoftLayer Cloud storage system specification. ....................................................................... 10

D3.1 V0.1

6

Executive Summary

The aim of this project is to develop a software platform for efficient and collaborative semiautomatic labelling and exploitation of large-scale video data solving existing needs for ADAS and Digital Cartography industries. Cloud-LSVA will use Big Data Technologies to address the open problem of a lack of software tools, and hardware platforms, to annotate petabyte scale video datasets, with the focus on the automotive industry. Annotations of road traffic objects, events and scenes are critical for training and testing computer vision techniques that are the heart of modern Advanced Driver Assistance Systems and Navigation systems. Providing this capability will establish a sustainable basis to drive forward automotive Big Data Technologies.

D3.1 V0.1

7

1. Introduction

This document describes the interface of the cloud system’s data storage facilities in terms of import and export functionality, data formats, and services. It is constructed to be a working document that evolves over time and serves different purposes throughout the project, as listed in Section 1.1. The focus of this document with respect to the overall architecture is depicted in Figure 1.

Figure 1: Cloud-LSVA architecture. This document describes the parts enclosed in red in terms of data import/export interfaces as well as the formats of stored data.

1.1 Content summary Section 2 This section describes the cloud system’s data storage facilities in terms of hardware and software Referring to overall architecture in Figure 1, these data storage facilities are depicted under the labels Object Stores and Data Stores. Section 3 This section describes the formats of data stored in the Data Stores and Object Stores. Section 4 This section details the import and export interfaces of the Data Stores and Object Stores and most importantly of the Upload Engine, the Mobile Data Services, and the Bulk Data Services, see Figure 1.

D3.1 V0.1

8

1.2 Purpose of Document Version Month 5 This version is intended as a comprehensive collection of all partner’s expectations regarding the data storage facilities. This collection only contains currently available services, data formats, and import/export interfaces. On basis of this version, it is decided which existing interfaces are made available and which novel interfaces are to be designed and developed during the project. Version Month 14 This intermediate version lists and describes all interfaces that are made available on the cloud system’s data storage facilities. This contains existing services, data formats, and import/export interfaces and also the specification of novel interfaces that are designed and developed during this project. This version serves as a guideline for the R&D efforts of project partners. Version Month 26 The final version lists and describes all interfaces of the cloud system’s data storage facilities that are at the end of the project. The purpose of this document is that of a reference manual for end-users of the developed systems.

1.3 Intended audience Referring to the three versions of this document listed in Section 1.1, the intended audience for the Month 5 and Month 14 version is restricted to the partners of the Cloud-LSVA project. The final Month 26 version is intended to be a publicly available document that serves as a reference manual for all who use the developed cloud system.

D3.1 V0.1

9

2. Storage specification

This section describes the cloud system’s data storage facilities in terms of hardware and software Referring to overall architecture in Figure 1, these data storage facilities are depicted under the labels Object Stores and Data Stores.

2.1 Storage server specification The top-level storage architecture, provided in Figure 2, consists of Network Attached Storage (NAS) together with an object store, which together provide default cloud storage services. It is accompanied by an Upload Engine that provides extended (Cloud-LSVA specific) low-level access to stored data. This Upload Engine itself is part of the core Cloud-LSVA services and therefore described in Section 4. Users of the system can interact (depending on security levels) with stored data at all levels (i.e. extended, default, or directly with the NAS). The direct and default levels are detailed in Section 2.2 and the extended level (part of the Cloud-LSVA services) in Section 4.

Figure 2: Top-level data storage architecture.

During the start of the project, the core of the data storage facilities will consist of three local cloud storage systems, one is located at Vicomtech (San Sebastian, Spain), one at TU Eindhoven (Eindhoven, The Netherlands), and one at DCU (Dublin, Ireland). Their specification is listed in Table 1. The specific local cloud storage systems used, only serve as a reference during the start of the project and the developed Cloud-LSVA system generalizes to equivalent cloud storage systems. Over the duration of the project we will shift from using the local systems to using IBM SoftLayer cloud storage facilities.

Table 1: Local cloud storage system specification.

Model: Synology NAS RS3614XS 2U

Capacity: 96TB (12 x 8TB) using EXT4 or Btrfs

Processor: Intel Core i3 4130 Dual-Core (3.40 GHz)

Memory 4 GB DDR3 ECC

RAID 0, 1, 5, 6, 10, JBOD

Network 1 GBit (up to 10 GBit)

Encryption Dedicated encryption engine (AES at 2,485MB/s in reading)

OS Synology DiskStation Manager (Linux-based)

As of phase one, the cloud storage is intended be hosted in IBM SoftLayer, with the hardware specification as listed in Table 2 below. The specification has been scaled to fit that of the local

Data Stores (NAS)

DATA DATA

DATA DATA

User interaction

Object Stores

Default cloud services

Upload Engine

Cloud-LSVA specific

User interaction

User interaction

D3.1 V0.1

10

implementations of the aforementioned consortium members (Table 1). Implementing the cloud storage to mirror that of the local implementations, ensures that there is less resources sitting idle, thus maximising the effectiveness of the budget. When required, the cloud infrastructure can be expanded in order to increase storage capacity and/or computing power. The storage device will be custom built on a bare metal server using the QuantaStor operating system (OS). This OS will deliver unified storage as a combined Storage Area Network (SAN) and Network Attached Storage (NAS) setup, bundled with a suite of encryption capabilities. The main difference between the two technologies is that a SAN traditionally deals with block storage, and a NAS with file storage. As a result of that, the protocols that are suitable for sending the data to the processing nodes also differ, such as iSCSI for SAN and NFS for NAS.

Table 2: SoftLayer Cloud storage system specification.

Model: Custom

Capacity: 96TB (up to 288TB 36 x 8TB SATA)

Processor: Dual Intel 2.00 GHz Xeon E5-2620 6-core (2-2.5GHz)

Memory 16G (up to 512GB)

RAID 0, 1, 5, 6, 10, JBOD

Network 1 Gbps (up to 10 Gbps)

Encryption AES-256, SMB3, IPSec, and HTTPS.

OS QuantaStor 3.x

2.2 Supported software and services Note M6 version: In this version of the document all software and services that are currently available or will become available are listed. Not all software and services will be used in the final Cloud-LSVA system. See Section 1.2 for more information on the different versions of this document. The Data Stores and Object Stores will deliver default cloud storage services to the Cloud-LSVA system. For the reference system that is used during the start of the project, the most important services are detailed below. For all available software and services, see www.synology.com. Cloud Station Server and Clients Cloud Station Server allows user to sync their data from multiple platforms, centralizing it on the Synology NAS keeping historic versions of all important files. Install the client utilities on Windows, Mac, Linux, as well as Android and iOS devices to keep files in sync across all platforms. Cloud Station ShareSync allows users to connect to any Cloud Station Server. They use it to sync files in real time, meaning that there is always a backed up version of your file safely secured on the remote Synology NAS device. At the same time, the host Synology NAS can collect and distribute data to all connected clients, syncing data across multiple clients, allowing for effective cross-site collaboration. With Cloud Sync, users can seamlessly sync and share files among your DiskStation and multiple public clouds, such as Dropbox, Baidu Cloud and Google Drive.

https://www.synology.com/en-us/dsm/app_packages/all_app

D3.1 V0.1

11

Directory Server Directory Server provides LDAP service with centralized access control, authentication, and account management. You can manage LDAP users and groups with this package. WebDAV WebDAV Server allows users to edit and manage files stored on the remote servers. When WebDAV Server is enabled, client programs that support WebDAV, such as certain Windows apps, Mac OS Finder, Linux File Browser, will be able to remotely access Synology NAS just like accessing a local network drive. Media Servers Media Server provides a multimedia service for you to browse and play the multimedia contents on Synology NAS via DLNA/UPnP home devices. With Media Server, you can easily connect those devices such as TV sets and stereo systems to your home network, and stream multimedia files stored on Synology NAS to the devices to enjoy music, photos, and videos.

VPN Server

VPN Server offers an easy VPN solution that turns your Synology product into a VPN server, providing a secure method to connect to a private LAN at a remote location. All PPTP, OpenVPN, and L2TP/IPSec services are supported.

D3.1 V0.1

12

3. Data formats

Note M6 version: In this version of the document all data formats that are currently available or will become available are listed. Not all data formats will be used in the final Cloud-LSVA system. See Section 1.2 for more information on the different versions of this document. This section describes the formats of data stored in the Data Stores.

3.1 Archive data types Archive data types allow the storage of multiple and different data elements in a single file. Advanced archives, such as listed below, furthermore allow for synchronized recording and playback of data elements in the archive. ROS bags Bags are the primary mechanism in ROS for data logging, which means that they have a variety of offline uses. Researchers have used the bag file toolchain to record datasets, then visualize, label them, and store them for future use. Bag files have also been used to perform long-term hardware diagnostics logging for the PR2 robot. Tools like rqt_bag allow you to visualize the data in a bag file, including plotting fields and displaying images. You can also quickly inspect bag file data from the console using the rostopic command. rostopic supports listing bag file topics as well as echoing data to screen. There are also programmatic APIs in the rosrecord package that give C++ and Python (and other programming languages) packages the ability to iterate over stored messages. For quicker manipulations to bag files, the rosbag tool supports rebagging a bag file, which allows you to extract messages that match a particular filter to a new bag file. The data stored within bag files is often very valuable, so bag files are also designed to be easily migrated when msg files are updated. The bag file format stores the msg file of the corresponding message data, and tools like rosbagmigration let you write rules to automatically update bag files when they become out of date. Protocol buffers Protocol buffers (https://developers.google.com/protocol-buffers/) is a flexible, automated mechanism for serializing structured data. The structured data is defined once, and a set of generators can used to produce source code to easily write and read the data to and from a variety of data streams and using a variety of languages. Versioning features are also possible, to update the data structure without breaking deployed programs that are compiled against the "old" format. RTMAP .rec files RTMaps recording format is designed to store multiple asynchronous data streams. Each stream sample is store with its timestamp information for accurate data playback (including streams synchronisation) in the so called .rec file. Each stream can be recorded in its own file to allow easier access to a subset of streams within a large recording (no need to read/decode the full recording to read only one data stream). This is particularly useful in the cloud context where bandwidth and processing power must be optimised. Storing each stream in its own files also enable easy data access from standard tools thanks to the use of standard or open storage formats. Random (and fast) access is also possible thanks to the use of multiple index files: one for the recording (main .idx file) and one per stream.

3.2 Image and video formats RAW (camera specific) A camera raw image file contains minimally processed data from the image sensor of either a digital

https://developers.google.com/protocol-buffers/

D3.1 V0.1

13

camera, image scanner, or motion picture film scanner.[1][2] Raw files are named so because they are not yet processed and therefore are not ready to be printed or edited with a bitmap graphics editor. Normally, the image is processed by a raw converter in a wide-gamut internal colorspace where precise adjustments can be made before conversion to a "positive" file format such as TIFF or JPEG for storage, printing, or further manipulation. This often encodes the image in a device-dependent colorspace. There are dozens, if not hundreds, of raw formats in use by different models of digital equipment (like cameras or film scanners). PPM (no compression) The PPM format is a lowest common denominator color image file format. It should be noted that this format is egregiously inefficient. It is highly redundant, while containing a lot of information that the human eye can't even discern. Furthermore, the format allows very little information about the image besides basic color, which means you may have to couple a file in this format with other independent information to get any decent use out of it. However, it is very easy to write and analyze programs to process this format, and that is the point. PNG (lossless compression) Portable Network Graphics is a raster graphics file format that supports lossless data compression. PNG was created as an improved, non-patented replacement for Graphics Interchange Format (GIF), and is the most used lossless image compression format on the Internet.[4] PNG supports palette-based images (with palettes of 24-bit RGB or 32-bit RGBA colors), grayscale images (with or without alpha channel), and full-color non-palette-based RGB[A] images (with or without alpha channel). PNG was designed for transferring images on the Internet, not for professional-quality print graphics, and therefore does not support non-RGB color spaces such as CMYK. JPEG (lossy compression) JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality. JPEG compression is used in a number of image file formats. JPEG/Exif is the most common image format used by digital cameras and other photographic image capture devices; along with JPEG/JFIF, it is the most common format for storing and transmitting photographic images on the World Wide Web.These format variations are often not distinguished, and are simply called JPEG. JSON JavaScript Object Notation (JSON) can be used as format for storing images. An example JSON image file header is provided in Annex A. JSON is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication, largely replacing XML. JPEG2000 (lossy compression) The JPEG2000 file format uses the discrete wavelet transform (DWT) for image compression. The DWT applies successively a bank of high- and low-pass filters on the image and encodes the differences between the original and the filtered images. In contrast to the discrete cosine transform (DCT), natively used in JPEG file format and applied to image blocks only, the DWT is applied to the entire image. Because of this, and considering that the filters are applied successively (from high- to low-pass), the DWT inherently allows for progressive decoding of the image. Typically, better compression results are obtained using the DWT, when compared to the DCT, for the same quality of the decoded image. Nevertheless, similar to the DCT, due to the quantization and entropy encoding of the difference images obtained, the DWT also belongs to the lossy compression methods. MPEG Common standard used for coding and compression of moving images, defined by Moving Picture Experts Group in different versions.

The MPEG-1 standard on “Coding of moving pictures and associated audio for digital storage media at up to about 1,5 bit/s”, ISO/IEC 11172, is optimized for the resolution of 352x288 PAL or 352x240 NTSC pixels, at the framerate of 30 images per second. With an approximate

D3.1 V0.1

14

bitrate of 1.5 Mbit/s MPEG-1 offers VHS image quality.

The MPEG-2 standard on “Generic coding of moving pictures and associated audio”, ISO/IEC 13818 and ITU-T Rec. H.222.0 and Rec. H.262, has eleven parts, of which the first three (systems, video and audio) were approved in 1994. It aims for a wider range of applications, such as streaming video, HDTV, and digital sound. It was adopted by digital TV and radio broadcasting over satellite and cable, with considerable success, and it also specifies the video format of the widely spread DVD media.

The MPEG-4 standard on “Coding of audio-visual o jects”, ISO/IEC 14496, was designed for systems of lower bitrates than those of MPEG-2. It is the first standard which handles the multimedia content as a set of audiovisual objects that can be presented, manipulated and transmitted independently. The part 2 corresponds to the MPEG-2 video encoding standard extended to visual objects. It is also generally referred to it as MPEG-4 video standard. The part 10 changed the philosophy of the current compression. It presents a joined work with ITU-T VCEG and describes a new state of the art in video coding approach. This standard is referred to as MPEG-4 AVC (Advanced Video Coding) or H.264/AVC. The new developments of the H.264/AVC standard, when compared to MPEG-4 Part 2 Video, describe mostly refinements of the existing algorithms which should allow for better compression rates. The new features include, besides others: variable block-size motion compensation with blocks of size 16x16 to 4x4 samples, application of quarter-sample spatial resolution for the motion vectors, context-adaptive entropy coding, multiple reference picture motion compensation, improved motion reference, weighted prediction, 4x4 block-size transform, hierarchical block transform, in-the-loop deblocking filtering as a post processing approach, data partitioning, etc. Moreover, a network abstraction layer (NAL) was defined to enable an independent encapsulation of the video coding layer (VCL) and allow for easy conversion of the video stream to existing network transmission protocols (e.g. IP or MPEG-2 transport protocols). MPEG-4 standards are subject to royalty fees. Due to several parties involved, they can be licensed via the MPEG Licensing Authority, a patent pool management company not affiliated to MPEG, under the MPEG-4 Visual Patent Portfolio License.

MJPEG and Motion JPEG 2000 The Motion JPEG (MJPEG) format encodes a video as a sequence of separately compressed JPEG images (DCT compression algorithm). This format is often used by current low-cost video acquisition devices like webcams, IP cameras and older digital cameras models (e.g. Nikon D90, Pentax K-7). Since the MJPEG applies only an intraframe compression, it achieves only a limited compression ratio, 1:20 or lower. The current state of the art video codecs achieve real-world ratios 1:50 or better. Nevertheless, due to its simplicity the hardware requirements on processing power and memory are lower. Moreover, it provides higher quality of motion acquisition due to the omitted interframe compression. Considering the better results of DWT based compression of static images, applied in JPEG 2000 file formats, the Motion JPEG 2000 standard was defined by ISO/IEC 15444-3 and in ITU-T T.802. Similar to MJPEG, it applies lossy or lossless variants of the JPEG 2000 compression method and does not involve temporal, interframe, compression approaches. Video container files Video container is a metafile format specifying the coexistence of different data elements and metadata in one computer file. While the video encoding formats mentioned previously define the digital format of the video stream, i.e. a stream of 0’s and 1’s, the encoded video stream can be saved under various file formats which, in addition to the video data itself, include information about the encoder used, the parameters necessary for the decoding of the stream, specifications of the video, etc. Some container files even include multiple audio and video streams, subtitles, chapter information, synchronization tags and other. The most popular audiovisual data containers are:

3GP – container format used by many mobile phones, based on the ISO base media file format;

Advanced Streaming Format (ASF) - container for icrosoft‟s Windows edia Video files;

Audio-Video Interleaved (AVI) - the standard Microsoft Windows media container, based on a

D3.1 V0.1

15

general Resource Interchange File Format;

Macromedia Flash Video (FLV, F4V) - container for video and audio streams by Adobe Systems;

Matroska (MKV) – an open standard and open source container format able to hold “virtually anything”;

MJ2 - Motion JPEG 2000 file format, based on the ISO base media file format which is defined in MPEG-4 Part 12 and JPEG 2000 Part 12;

QuickTime File Format (MOV) - standard QuickTime video container developed by Apple Inc., it was used as a base for the definition of the MPEG file transfer formats;

MPEG - standard container for MPEG-1 and MPEG-2 elementary streams, used also on DVD-Video discs;

MPEG-2 - standard container for digital broadcasting and for transportation over unreliable media; used also on Blu-ray Disc video;

MP4 - standard audio and video container for the MPEG-4 multimedia portfolio, based on the ISO base media file format defined in MPEG-4 Part 12 and JPEG 2000 Part 12;

Ogg - standard container for Vorbis audio format and VP3 (Theora) video format;

RealMedia (RM) - standard container for proprietary RealVideo and RealAudio multimedia formats

WebM – multimedia support over HTML5 developed by Google as an extension from the Matroska format.

D3.1 V0.1

16

3.3 Annotation formats Although annotation is a well-known topic in the scientific community, there is no consensus on annotation formats, and no common framework has been adopted by the majority of scientist, who still propose task-specific formats often attached to specific domains or purposes and without the required flexibility to be adopted widely. It is worth mentioning that data model and storage file format are orthogonal concepts, i.e. a given data model might be stored as an XML file, or alternatively as JSON or MPEG-7 files, if the data model is arranged in a compatible shape. The following paragraphs summarize some of the existing initiatives that provide such general annotation formats. Viulib VCD Viulib VideoContentDescription (VCD) is an annotation format specially devised to describe content of image sequences, in the form of spatial, temporal, or spatio-temporal entities. The annotation of spatial entities is abstract to include any possible Object data type, such as points, lines, polygons, binary masks, or generic arrays of numbers. Semantic actions occurring in the videos can be easily described as Contexts, or Events, reaching any desired level of annotation complexity. The VCD is designed to support connection with ontologies, through the definition of Relations. All the elements can be defined and manipulated in both offline (batch processing) and online (updated sequentially) modes. The VCD C++ API provides a number of tools to perform operations on the annotations, such as create, update/modify, delete, find, etc. Annotations can be stored as XML files, but also messaged as JSON strings. It is integrated in the latest version of Vicomtech’s Viulib libraries (www.viulib.org). Additional implementations include C++ classes that compare two different VCD files and create evaluation reports (e.g. comparison with ground truth). ViPER XGTF The Video Performance Evaluation Resource (VIPER) is a toolkit of scripts and Java programs that enable the markup of visual data ground truth, and systems for evaluating how closely sets of result data approximate that truth (http://viper-toolkit.sourceforge.net). It comes with a number of tools, such as the ViPER Performance Evaluation Tool (ViPER-PE), the ViPER Ground Truth Authoring Tool (ViPER-GT), the Java MPEG-1 Decoder, etc. The annotation format is called XGTF, a temporally qualified relational model, consisting on Descriptors and their Attributes, which correspond to rows and columns in an SQL database. The TRECVid initiative has traditionally proposed the ViPER tools and format for the evaluation and comparison of algorithms. VATIC Video Annotation Tool from Irvine, California (VATIC), is a free, online, interactive video annotation tool for computer vision research that crowdsources work to Amanzon’s Mechanical Turk. This tool makes it easy to build massive, affordable video data sets and can be deployed on a cloud. VATIC uses a simple delimiter-separated value (DSV) numerical format, which can be interpreted only with specification files with fixed concepts or attributes of objects. Originally VATIC can only annotation spatial objects as bounding boxes, although a number of branch development have emerged recently to include human action annotation capabilities. Caffe-compatible formats for annotated image databases Caffe (http://caffe.berkeleyvision.org/) is an open source deep learning framework. It is the backend of NVIDIA DIGITS (https://developer.nvidia.com/digits), which is an open source interactive deep learning GPU training system. Besides, Caffe is also one of the dependencies of Vicomtech’s Viulib libraries (www.viulib.org) to handle deep learning technology. Thus, deep neural network models based on Caffe-compatible data formats can be trained with NVIDIA DIGITS and deployed in different kind of devices via Viulib. Caffe-compatible formats for annotated image databases consider different kind of computer vision problems: the holistic classification of images, the detection of object bounding boxes in images, the detection of object shape landmarks, the semantic segmentation of image regions, among others. Depending on the kind of problem, the kind of annotation is different from the point of view of users, but in the end of the process, Caffe needs the data to be converted to one of the following data

http://viper-toolkit.sourceforge.net/

http://caffe.berkeleyvision.org/

https://developer.nvidia.com/digits

http://www.viulib.org/

D3.1 V0.1

17

formats, which are especially designed to handle big data:

LMDB (https://symas.com/products/lightning-memory-mapped-database/): Lightning Memory-Mapped Database (LMDB) format is based on a software library of the same name that provides a high-performance embedded transactional database in the form of a key-value store. LMDB stores arbitrary key/data pairs as byte arrays, has a range-based search capability, supports multiple data items for a single key and has a special mode for appending records at the end of the database (MDB_APPEND) which gives a dramatic write performance increase over other similar stores. LMDB is not a relational database, it is strictly a key-value store like Berkeley DB and dbm.

LevelDB (https://github.com/google/leveldb): LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. LevelDB stores keys and values in arbitrary byte arrays, and data is sorted by key. It supports batching writes, forward and backward iteration, and compression of the data via Google's Snappy compression library. LevelDB is not a SQL database. Like other NoSQL and Dbm stores, it does not have a relational data model and it does not support SQL queries. Also, it has no support for indexes. Applications use LevelDB as a library, as it does not provide a server or command-line interface.

HDF5 (https://www.hdfgroup.org/HDF5/): HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.

NVIDIA DIGITS can handle in a user-friendly way the conversion of annotated data to the LMDB format (also to HDF5 in the newer versions). For instance, for the image classification problem, it requires the user to provide the path to a folder that contains a database of images, where all the images of the same class are stored in a folder named with the corresponding class name.

JSON JavaScript Object Notation (JSON) can be used to encode and describing image content, i.e. objects,

events, and context. The exact format used, is decided by the developer. A possible JSON-based

format for objects, events, and contexts is provided below.

An object is a logical component that can represent different types of physical objects such as “car”, “pedestrian”, “traffic sign”, “zebra cross”, “curb”, “lane markings”. A distinction can be made between static and dynamic objects. Depending on the type of object, we use different core attributes to describe their location, i.e.

Static: o 2D - Object: represented by 2D bounding box o 3D - Object: represented by 3D bounding box

Dynamic o 2D Object: represented by 2D bounding box plus information encoding the

timestamps of the beginning and the end displacement o 3D Object: represented by 3D bounding box plus information encoding the

timestamps of the beginning and the end displacement We propose to have one model for representing different categories of objects. The key “category” can have one of the following values:

“Static_2D_object”: static 2D object presents in one frame/image

“Dynamic_2D_object”: dynamic 2D object presents in a series of frames delimited by the value of the key “keyFrameId” which contain the position of the first and last frame

“Static_3D_object”: position in 3D of an object in one frame

“Dynamic_3D_object”:

https://symas.com/products/lightning-memory-mapped-database/

https://github.com/google/leveldb

https://www.hdfgroup.org/HDF5/

D3.1 V0.1

18

position in 3D of an object in a sequence of frames The location of the object in the video is defined by:

“keyFrameID”: an array of two values contains the reference of the first frame where the object is present and the last frame

“bbox”: an array of 4 numbers encoding the bounding box of the object in the first frame. Two possibilities to encode the bounding box:

o 2 points: top left and down right o 1 point (top left), lent and orientation

“geometry”: a polygon containing the coordinates of the object in each frame An event is used to highlight and describe the behaviour objects present in the scene “obstacle in the road”, the “status” of the pedestrian: “direction toward the road”. The JSON description of an event is provided below. The context is metadata used to describe the environment and the conditions of the road. The JSON description of context metadata is provided below. An example JSON description of an object, event, and context is provided in Annex A. Open Annotation The Open Annotation Data Model (http://www.openannotation.org) is an XML, RDF based format to enable the annotation/commenting of any web resource. The format can then be shared across a wide range of open tools that understand the format. The format provides textual as well as graphical (primitive 2D geometry constructs including polygons) selectors to identify specific targets that can be linked to annotations. Tensorflow TFRecords. Format used by Tensorflow (https://www.tensorflow.org) for efficient data reading while training a neural network on a large dataset. A TFRecords file contains a sequence of string with CRC hashes.

3.4 Positioning formats NMEA NMEA is a combined electrical and data specification for communication between marine electronics such as echo sounder, sonars, anemometer, gyrocompass, autopilot, GPS receivers and many other types of instruments. It has been defined by, and is controlled by, the National Marine Electronics Association. It replaces the earlier NMEA 0180 and NMEA 0182 standards.[1] In marine applications, it is slowly being phased out in favor of the newer NMEA 2000 standard. ROS messages ROS nodes communicate with each other by publishing messages to topics. A message is a simple data structure, comprising typed fields. Standard primitive types (integer, floating point, boolean, etc.) are supported, as are arrays of primitive types. Messages can include arbitrarily nested structures and arrays (much like C structs). Nodes can also exchange a request and response message as part of a ROS service call. These request and response messages are defined in srv files. OpenLR OpenLR is an open, compact and royalty free dynamic location referencing method, which enables reliable object location exchange and cross-referencing in digital maps of different vendors and versions.

http://www.openannotation.org/

https://www.tensorflow.org/

D3.1 V0.1

19

3.5 Map formats G2O g2o is an open-source C++ framework for optimizing graph-based nonlinear error functions. g2o has been designed to be easily extensible to a wide range of problems and a new problem typically can be specified in a few lines of code. The current implementation provides solutions to several variants of SLAM and BA. NDS The Navigation Data Standard (NDS) is a runtime format for navigation maps that is scalable to enable incremental updates to the NDS map (regional, tile or attribute patch). It is the current automotive industry standard . For the lane navigation research, the NDS map is extended with multiple layers containing detailed-lane-information linked to road segments in the map that is required for lane-navigation- and highly-automated-driving systems.

RoadDNA RoadDNA, as illustrated in Figure 4, delivers a highly optimized, 3D lateral and longitudinal view of the roadway. With this, a vehicle can correlate RoadDNA data with data obtained by its own sensors. By doing this correlation in real-time the vehicle knows exactly where it is located on the road, even while traveling at high speeds. By converting a 3D point cloud of road side patterns into a compressed, 2D view of the roadway, RoadDNA delivers a solution that can be used in-vehicle with limited processing requirements. Without losing roadway detail, TomTom RoadDNA follows a feature agnostic approach which is robust and scalable. This technique eliminates the complexity of identifying each single roadway object, but rather creates a unique pattern of the roadway environment.

Figure 3: RoadDNA illustration.

D3.1 V0.1

20

OpenStreetMap XML OpenStreetMap is a is a collaborative project, based on community input, to create a free editable map of the world. The map data format used by OpenStreetMap is based on XML. OpenStreetMap uses a topological data structure, with four core elements (also known as data primitives):

Nodes are points with a geographic position, stored as coordinates (pairs of a latitude and a longitude) according to WGS 84.[67] Outside of their usage in ways, they are used to represent map features without a size, such as points of interest or mountain peaks.

Ways are ordered lists of nodes, representing a polyline, or possibly a polygon if they form a closed loop. They are used both for representing linear features such as streets and rivers, and areas, like forests, parks, parking areas and lakes.

Relations are ordered lists of nodes, ways and relations (together called "members"), where each member can optionally have a "role" (a string). Relations are used for representing the relationship of existing nodes and ways. Examples include turn restrictions on roads, routes that span several existing ways (for instance, a long-distance motorway), and areas with holes.

Tags are key-value pairs (both arbitrary strings). They are used to store metadata about the map objects (such as their type, their name and their physical properties). Tags are not free-standing, but are always attached to an object: to a node, a way or a relation. A recommended ontology of map features (the meaning of tags) is maintained on a wiki.

OpenDRIVE OpenDRIVE (http://www.opendrive.org) is an open, XML format to describe track-based road network including road geometry, lanes, road signs, signals, etc. The primary purpose of the format is to provide interchange capabilities between 3D simulation environments commonly used in ADAS development.

3.6 Scenario Formats Scenario formats are primarily used to exchange simulation scenarios in terms of scene content and scene layout as well as object movement and behavior. OpenSCENARIO OpenSCENARIO (http://openscenario.org) is an open, XML format to describe dynamic contents in driving simulation applications. The main purpose is to help define and exchange scenarios between applications as well as providing a standardized / open toolsets for validation of scenario definitions. This format builds on top of OpenDRIVE.

3.7 Machine learning model formats

OpenCV XML/YAML The OpenCV (Open Source Computer Vision) libraries provide functionalities for real-time processing of computer vision tasks. It natively supports the storage of all internal data structures and primitive data types in textual and structured XML (http://www.w3c.org/XML) or YAML (http://www.yaml.org) file formats. OpenCV, uses primarily these textural formats for the distribution of basic, pre-trained classification objects for human, face or eye recognition based on BOOST cascades.. Binary-extended OpenCV YAML This is an extension on the OpenCV YAML data format that allows for binary data fields, thereby providing a more efficient storage structure. Dlib object serlialization The Dlib open source toolkit for machine learning supports object serialization into a binary output stream, for saving the object’s state at any time. In addition, the binary Google protocol buffer objects can also been read/written via the Dlib’s serialization routines. Viulib supported formats

http://www.opendrive.org/

http://openscenario.org/

D3.1 V0.1

21

Vicomtech’s Viulib libraries (www.viulib.org), extend the file storage formats provided by OpenCV (i.e. XML/YAML) to objects of higher visual processing tasks, such as descriptors. classifiers, trainers, detectors, etc. Moreover, it also encapsulates the Dlib’s serialization streams into the XML/YAML formats, by converting them in base64 strings, whenever necessary. Currently, the human readable and easily editable text formats are preferred, although the OpenCV’s binary-extended YML formats are also supported. Caffe-compatible formats for deep neural network models Caffe relies on Google Protocol Buffer (https://developers.google.com/protocol-buffers/) to define deep neural network models for the following strengths: minimal-size binary strings when serialized, efficient serialization, a human-readable text format compatible with the binary version, and efficient interface implementations in multiple languages, most notably C++ and Python. This all contributes to the flexibility and extensibility of modeling in Caffe. Caffe-compatible deep neural network models, also called Nets, are defined in plaintext protocol buffer schema (‘.prototxt’) while the learned models are serialized as binary protocol buffer (binaryproto) ‘.caffemodel’ files. More specifically, the Caffe-compatible model format is defined by the following protobuf schema: https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto. This source file is mostly self-explanatory so one is encouraged to check it out. A Caffe-compatible Net can be defined in two different kind of ways: for training and for deployment. The differences between one and the other scheme are mainly related to the way in which the inputs and outputs are defined. Further details in: http://caffe.berkeleyvision.org/tutorial/ On the other hand, Caffe requires an additional plaintext protocol buffer file (‘.prototxt’), called Solver, which orchestrates the model optimization by coordinating the network’s forward inference and backward gradients to form parameter updates that attempt to improve the loss. The responsibilities of learning are divided between the Solver for overseeing the optimization and generating parameter updates and the Net for yielding loss and gradients. NVIDIA DIGITS provides a user-friendly GUI to handle the training parameters which are then converted to a Solver file for Caffe. Tensorflow represents machine learning model (e.g neural network) as a computational graph. The graph is serialized using Google Protocol Buffer. Tensorflow provides APIs in C++ and Python to save and load this computation graph. PMML is a standard for statistical and data mining models. http://dmg.org/.

3.8 Meta data formats OpenCV YAML YAML is a human-readable data serialization language that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail (RFC 2822). The OpenCV YAML format is, for example, used to store camera calibration (intrinsic and extrinsic). information. ASAM MCD-2 MC (aka ASAP2) The ASAP2 (https://wiki.asam.net) format is an automotive specific format that is used to define the description of format of internal an ECU variables or messages used in measurement and calibration. The standard is commonly used in automotive environments. FIBEX The "Fieldbus Exchange Format" FIBEX (http://www.asam.net ) is an XML based format used to describe automotive embedded networks as well as the data format for message-based bus communication systems. The definition information includes the network topology, configuration parameters, schedules, frames and signals as well as their coding on the bit level. FIBEX has become established as a standard for the FlexRay bus system. AUTOSAR The “AUTomotive Open System Architecture” AUTOSAR (https://www.autosar.org) is an association including worldwide stakeholders in the automotive sector. Its objective is to provide a set of interchange standards for the define of automotive architectures, components and messages

http://www.viulib.org/

https://developers.google.com/protocol-buffers/

https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto

http://caffe.berkeleyvision.org/tutorial/

http://dmg.org/

https://wiki.asam.net/display/STANDARDS/

http://www.asam.net/

https://www.autosar.org/

D3.1 V0.1

22

exchange on automotive networks.

4. Import and Export interfaces

Note M6 version: In this version of the document all Import and Export interfaces that are currently available or will become available are listed. Not all Import and Export interfaces will be used in the final Cloud-LSVA system. See Section 1.2 for more information on the different versions of this document. The storage part (Data Stores and Object Stores in Figure 1) of the Cloud-LSVA system will support certain default cloud storage services, see Section 2.2. These services provide default import and export interface, currently including:

LDAP, WebDAV, FTP, SMB2, SMB3 (encryption), AFP, NFS, CalDAV, CardDAV.

Docker images

BT, FTP, HTTP, NZB, Thunder, FlashGet, QQDL, and eMule.

SVN, GIT

DLNA/UPnP

Web-based (Apache, Joomla, Drupal, PhP, etc.) Extending on the default services, the Cloud-LSVA system provides extended import and export services: i.e. the Upload Engine, the Mobile Data Services, and the Bulk Data Services (see Figure 1). The sole purpose of theses extended services is to provide data access mechanism other than what is provided by the cloud storage system; they do not store data themselves.

4.1 Upload Engine ROS server On the Cloud-LSVA data server, being part of the data storage system, ROS functionality is provided. This makes it possible to automatically and transparently import and export ROS message streams from and to the Data Stores and Object Stores . ROS is based on a publisher – subscriber architecture. DDS The Data Distribution Service for Real-Time Systems (DDS) is an Object Management Group (OMG) machine to machine middleware standard that aims to enable scalable, real-time, dependable, high performance and interoperable data exchanges between publishers and subscribers. DDS is designed to address the needs of applications like autonomous vehicle, financial trading, air traffic control, smart grid management, and other big data applications. RTMAPS RTMaps (installed on a local computer or in the cloud) could be used to access and process data in multiple ways. The full processing could be done in RTMaps with standard and custom components in a diagram (.rtd). RTMaps could also be used to read and stream data to another application (locally or remotely) thanks to a set of interfaces like simple data streaming (TCP or UDP), RTSP video streaming, DDS interface or ROS interface /bridge.

4.2 Mobile Data Services Robot Operating System Robot Operating System (ROS) is a collection of software frameworks for robot software

D3.1 V0.1

23

development, (see also Robotics middleware) providing operating system-like functionality on a heterogeneous computer cluster. ROS provides standard operating system services such as hardware abstraction, low-level device control, implementation of commonly used functionality, message-passing between processes, and package management. Running sets of ROS-based processes are represented in a graph architecture where processing takes place in nodes that may receive, post and multiplex sensor, control, state, planning, actuator and other messages. Despite the importance of reactivity and low latency in robot control, ROS, itself, is not a real-time OS, though it is possible to integrate ROS with real-time code. Location-based-dynamic-objects framework The Location-based-dynamic-objects (LBDO framework) is developed to support research on communication concepts for cooperative-navigation applications distributed over a heterogeneous system of cloud computers and embedded car computers that realize real-time data loops.

This framework is exclusively developed for open research collaborations. Current connected navigation systems are designed for transaction frequencies of 1 minute or more (e.g. TPEG) and do not take the crowd-sourcing part of the loop into consideration. This framework is designed to handle applications that rely on sub-second latencies between client and provider. An example is a green-wave application that needs an update of the traffic light signal status every second if a car is on the approach path for that traffic light. This prototype implementation is named the ‘location-based-dynamic-object’ concept, in short LBDO. An application can register itself for ‘dynamic’ data attribute updates associated to a real world object. A transfer is triggered if the car enters the location where the dynamic data is relevant for the embedded application. In the crowd-sourcing case a crowd-sourcing action is triggered if the car enters a mining location. A “location-based-dynamic-object” is defined by

the location for which the dynamic data is relevant (e.g. approach-path to traffic jam)

lifetime of the location based object

its dynamic data attributes (e.g. traffic light color status)

The location can be specified in different formats such a WGS84 and OpenLR, it can be specified as a point, area or graph. In the LBDO transport system, the channel concept is introduced that specifies a group of objects sharing ‘similar’ communication parameters in a specific application context. This way the transport system is able to select a configuration for the communication system that is optimal for the applications subscribing to dynamic data updates specific real-life objects.

Figure 4: LBDO framework overview.

To make sure that the LBDO descriptions used to trigger a data transfer entering an LBDO location are kept fresh in an efficient way, an incremental update mechanism is introduced based on tiles. The

D3.1 V0.1

24

world is divided into tiles. LBDO objects are linked to those tiles. Tiles can range from rectangles of a couple of square kilometers to hundreds of square kilometers in size, depending on the density of the LBDO objects per square kilometer. If a car comes in the vicinity of a tile, the delta object descriptions linked to that tile are refreshed. The framework also implements a loader concept. A channel can implement a loader for run-time re-formatting proprietary messages to the LBDO format.

D3.1 V0.1

25

4.3 Bulk Data Services Binary large object A Binary Large OBject (BLOB) is a collection of binary data stored as a single entity in a database/file management system. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob. In the Cloud-LSVA system, we will use BLOB transfer services that physically ship hard disks to the cloud storage data center. There the contents of the disks will be treated as BLOBs and copied directly to the Cloud-LSVA Data Stores.

D3.1 V0.1

26

5. Conclusion

[final version only]

D3.1 V0.1

27

Annexes

1. Annex A: JSON Examples Image

{

// type identifier for images “type”: image,

// global unique id of image "id" : string,

// if applicable, global unique id of the video from which this image

// was extracted “videoId”: string,

// width of image in pixels

"width" : number,

// height of image in pixels "height" : number,

// time of capturing this image

"timestamp" : datetime,

// jpg, png, etc.

“Encoding_format”: string,

// the data in bytes

“encoded_data”: array[byte] }

Object

{

// TRUE or FALSE to indicate whether or not the user has validate

// the object

“valid”: Boolean

// the category of the object

“Category”: “Static_2D_object” | “Dynamic_2D_object” |

“Static_3D_object” | “Dynamic_3D_object”

// global unique identifier of this object

"id" : string,

// encodes the type of the object “car”, “truck”, “motorcyclist”,

// “biker”, “pedestrian”, “traffic sign”, “zebra cross”, “curb”,

// “lane markings”

“type”: string

// sequence of frames referenced by begin frame and end frame

“FrameSeq”: [number, number]

// bounding box is a 4-item array representing the rectangle

// that will contain the object

“bbox”: [[number, number], [number, number]] | // for 2D object

D3.1 V0.1

28

[number, number, number], [number, number, number] // for 3D object

// additional geometry of the object

"geometry": { "type": "Polygon", "coordinates":

// for 2D object

[ [number,number], [number,number], … [number,number] ]

|

// for 3D object

[ [number,number,number], [number,number,number,number ], …

[number,number,number ], [number,number,number,number ] ] }

}

Event

{


// the object


// global unique identifier of this event

"id" : string,

// describing the event

“type”: string

// sequence of frames reference by begin and end keyframe

“keyFrameId”: [ number, number]

// sequence of frames referenced by begin frame and end frame

“FrameSeq”: [number, number]

}

Context

{


// the object


// if applicable, reference of the video

“videoId”: string

// global unique identifier of this contexy

"id" : string,

// describing the context

“type”: string

// sequence of frames referenced by begin and end keyframe

“keyFrameId”: [number, number]

}

Documents

Cloud LSVA · V0.2 18-07-2016 G. Dubbelman Consolidated partner inputs ... Introduction ... the cloud infrastructure can be expanded in order to increase storage capacity and/or computing