19
20/04/2018 1 H2020 Inducement Prize: Big Data technologies H2020 Work Programme 2016-2017, Topic identifier: BigDataPrize-01-2017 Questions and Answers Q1. When sending the source code, will the intellectual property of the software remain to the participant? A1. The software's IPRs will remain with the contestant. For additional details please refer to Chapter 8.3 of the Prize's Rules of Contest http://ec.europa.eu/research/participants/data/ref/h2020/other/prizes/contest_rules/h2020-prizes- rules-big-data_en.pdf Q2. How do you guarantee the confidentiality of the jury and of the access to the source code in the evaluation of our software? A2. The experts that will be hired to support the EC in the evaluation of the prize will indeed be contractually bound to non-disclosure and confidentiality for all proceedings and software artefacts that will be evaluated. Q3. Is it allowed to load models from an online location? What locations are allowed (gitHub seems like a natural location)? This would enable participants to get the most out of the training and validation data while keeping the submission below 1 MB. A3. No. This is not allowed. The software, with the characteristics indicated in the Rules of Contest document, shall be uploaded directly on the contest platform as a package of maximum 1MB. Q4. Will the test data be used to pre-rank teams? It is clear that a different time period will be used to assess the final ranking but it is unclear if there is a difference between the test data and the data used to pre-rank teams. A4. There are three different datasets overall: starting kit (containing sample data for testing purposes), contest platform data, evaluation platform data (which is the final actual test), please consult the Rules of Contest again for which dataset applies to what phase/purpose.

H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

1

H2020 Inducement Prize: Big Data technologies

H2020 Work Programme 2016-2017, Topic identifier: BigDataPrize-01-2017

Questions and Answers

Q1. When sending the source code, will the intellectual property of the software remain to the

participant?

A1. The software's IPRs will remain with the contestant. For additional details please refer to Chapter

8.3 of the Prize's Rules of Contest

http://ec.europa.eu/research/participants/data/ref/h2020/other/prizes/contest_rules/h2020-prizes-

rules-big-data_en.pdf

Q2. How do you guarantee the confidentiality of the jury and of the access to the source code in the

evaluation of our software?

A2. The experts that will be hired to support the EC in the evaluation of the prize will indeed be

contractually bound to non-disclosure and confidentiality for all proceedings and software artefacts

that will be evaluated.

Q3. Is it allowed to load models from an online location? What locations are allowed (gitHub seems

like a natural location)? This would enable participants to get the most out of the training and

validation data while keeping the submission below 1 MB.

A3. No. This is not allowed. The software, with the characteristics indicated in the Rules of Contest

document, shall be uploaded directly on the contest platform as a package of maximum 1MB.

Q4. Will the test data be used to pre-rank teams? It is clear that a different time period will be used

to assess the final ranking but it is unclear if there is a difference between the test data and the data

used to pre-rank teams.

A4. There are three different datasets overall: starting kit (containing sample data for testing

purposes), contest platform data, evaluation platform data (which is the final actual test), please

consult the Rules of Contest again for which dataset applies to what phase/purpose.

Page 2: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

2

Q5. Is it allowed to include small trained models in the submission as long as the total size doesn't

exceed 1 MB?

A5. Anything that fits in standalone submission of max 1 MB is allowed as long as it is consistent with

the Rules of Contest.

Q6. Will there be a designated folder to store models locally or is it up to the participants to store the

models on an online location?

A6. Models can be stored locally between prediction steps, within a single submission to the platform.

However, please note that each individual submission to the platform is standalone and independent

from previous ones, and no data can be retrieved from the platform.

Q7. Our approach when building production forecasting systems is, usually, to pre-build as much of

the model as possible in order to reduce calculation times. Is it possible to include a data file

containing data for a pre-built model within our submission?

A7. Submission files that comply with the Rules of Contest and fit in standalone submission of max 1

MB are allowed. The working software implies that any data file used by the submission must be

accessible to the submitted software without requiring any intervention on the part of the contest

platform. Additionally, reliance on an auxiliary data file should comply with the requirement that the

submission be intelligible to an expert software engineer, see admissibility conditions related to the

working software on p. 5 of the Rules of Contest.

Q8. Will the testing data be of similar size as the training/validation data? If not, can you please give

an estimate of the expected difference in file sizes?

A8. The starting kit data, meant for testing, is intended to be smaller – we can't release more details

at this point.

Q9. Is it possible to add PyTorch to the suite of available software (http://pytorch.org/)? It is

becoming the serious alternative to Tensorflow/Keras (http://www.fast.ai/2017/09/08/introducing-

pytorch-for-fastai/).

A9. No additional languages or libraries/frameworks will be added.

Page 3: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

3

Q10. In the requirements only the following languages are listed: C, C++, Python 2, Python 3, Octave,

Julia, R. Can I expected other languages to be added later?

A10. The list contained the Rules of Contest document is final and we do not plan to add any other.

Q11. Should every adapt step prediction over a 60-minute interval be completely independent of

earlier time steps? The example code seems to indicate this is not the case. The

predictSpatioTemporal.py file contains a save model option. Can we use the output directory as the

cache directory between time steps and assume that this data will not be modified between time

steps?

A11. You may cache files in the output directory.

Q12. If we provide only the source code, how does the code get compiled? Would you run a script

that we provide that compiles the code before running the contest? We are likely to be using C++.

A12. We run no scripts outside of predict.sh. You should include the execution of compilation of code

in the submission itself, to be run when predict.sh first executes (safest). Alternatively you can include

binaries, but then you should include clear instructions of how this compilation process would be done

inside the provided docker at evaluation phase of the contest to arrive at the provided binaries

reproducibly. Note that the evaluation phase has no access to the Internet, so please ensure that all

source code necessary to compile it under platform evaluation conditions is found in the submission.

The maximum size requirement must be respected.

Q13. Is it right to assume that we provide a set of pre-compiled binaries as part of Section 2 of Part B

and that these pre-compiled binaries are used for the test or will the model have to be first compiled

on your instance?

A13. Technically pre-compiled binaries can be included, but if not compiled on machine, they are not

intelligible to an expert software engineer and therefore do not fulfil the admissibility conditions

related to the working software (p. 5 of the Rules of Contest).

Q14. On the test, am I right in thinking that initially within the data directory there will only be the

training data? After one step of the test you will place a file in the adapt directory and call our

program again. Our program can then use the extra information provided to make any adaptations

to the model before predicting the next time step. Your software will then iterate around placing

one file in the adapt directory and calling our software again until all time steps have been

completed?

A14. Yes, not just one file of target data at each adapt step but also additional auxiliary data files

synchronized to that step.

Page 4: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

4

Q15. The starting kit mentions that the “Open Power Systems Data" has been converted to .h5

format and sliced to be in sync with the target data. I do however only see one folder in

starting-kit\auxiliaryData\train\_aux\opsd15 for the 289 files in

starting-kit\notebook\See4C_starting_kit\sample_data\train\Xm1. The adapt opsd data

starting-kit\auxiliaryData\adapt\_aux\opsd15 does contain the 69 folders one would expect given

the 69 files in starting-kit\notebook\See4C_starting_kit\sample_data\train\Xm1. Will the starting kit

be updated soon to include all opsd train data?

A15. No, the train target data was sliced for your convenience. The train target and auxiliary data are

not synchronized by steps: you may use them as you wish, for example you can either merge the

target .h5 data, re-slice it according to your wish or slice the opsd data yourself, depending how you

choose to train on data.

Q16. Will the test and verification tasks be the same as the one provided in the sample kit, namely

the forecasting of flow-related measurements over a number of high-tension lines. We understand

that the set of lines maybe different in the test and validation steps from the set of lines provided in

the starting kit. This will enable us to decide, for example, whether the use of the auxiliary data sets

is useful for the model.

A16. Yes, same task type but different data and possibly parameters.

Q17. Can you indicate the size of the test data set so that we can make informed decisions about

how best to use the amazon instance. (For example, it would be useful to know whether the data

provided can fit within a single GPU instance so that we can decide on the optimal processing pattern

for the data). In particular the size of the historical dataset that will be provided on which to train

the model.

A17. It is not up to the participants to allocate the AWS resources, they just upload the software to the platform. We allocate exactly one "P2.8xlarge" instance of AWS for each "run", and it is available for max. 6 hours. The participant can decide to use only part of this allocation. See the Annex of the Rules of Contest.

Page 5: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

5

Q18. Concerning the spatialization in the starting-kit: the two mapping matrices (44x44 and 64x64)

seem to map a 1d-array into a 2d 'raster'. Are the close points in the 2d map close also in terms of

measurements? Can I assume, here or in the real competition, that two adjacent cells in the resulting

44x44 matrix are adjacent also in the reality?

A18. No, there are no guarantees of closeness among cells in the sample code. Reordering the data to

increase forecast accuracy is part of the challenge.

Q19. Your ReadMe suggests that missing values will be coded as NANs. However, both the viewer

we are using (HD5View), and the library we are using to pull the data into our system seems to be

reading missing values as zeros. For example, in X19.h5, line 171 (timestamp 1248265500) appears

(based on the viewer and library that we are using) to be completely full of zeros (so clearly a missing

value). I wonder if you could just confirm whether that line does indeed have NANs in it (and it is our

ignorance that is causing the problem) or whether it is genuinely full of zeros.

A19. Nonstationarities can involve elements of time series that are zeros for substantive portions of

the overall record; sample data is only a snippet.

Q20. In the starting kit phase, I can recover by myself the function index-to-unixtime. In the contest

platform, will you give contestants the unixtime and not the index?

A20. We provide unixtime time stamps, how you index it in your submission code is up to you. Please

pay attention to the README.txt file and instructions about reading the horizon value first, not to

make assumptions about its value.

Q21. Is it possible to officially add participants to an existing team?

A21. It is not possible to add participants after the registrations have been submitted and closed.

Q22 Are we allowed to provide our own Dockerfile which includes necessary dependencies (e.g. C++

HDF5 library) ?

A22 Within the 1MB limit useful libraries such as interfaces can be included, check provided docker in

interactive mode for related binaries and libraries already present.

Q23 If providing a Docker file with necessary dependency is not allowed, can we download and

compile the necessary dependencies? Our main concern is using with C++ HDF5 library for

reading/writing data, and the library itself can take more than 2MB in compressed form, while the

submission is limited to 1MB.

Page 6: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

6

A23 Please read A22 above. We will look into this issue, for now either please interface to present

binaries or use a higher level language to read/write/ translate to a form you can more easily read.

Q24 Could you clarify on the status of ‘python cuda’ and ‘python opencl’. These packages would be

extremely helpful, but they are commented out in the Dockerfile – referring to be “packages with

problems”. Any chance we can get them running?

A24 Please check the latest docker file functionality.

Q25 Are we restricted to the packages present in the Dockerfile or can we load additional ones (at

least if they are available in the main/universe repositories)?

A25 You must include additional packages/code in the submission itself as per Rules of Contest.

Q26 It is with certain disappointment that we see the target data time series being stripped off their

geolocations. Thus, it all boils down to finding the needle in a haystack, i.e. a kind of big regression

analysis. What is the idea behind withholding that much prior knowledge? It is somewhat weird to

call it a geospatio-temporal forecasting competition then…

A26 The target data is presumed to have complex topological and causal relationships that require

extensive analysis, even if the geolocations are given. In a Big Data context, due to privacy and

confidentiality issues, anonimisation of data is common. To encourage the development of novel

methods within constraints of confidentiality, the geospatial information is anonimised.

Q27 Auxiliary data contains useful spatial information but the targets (the tension lines) are

anonimised with respect to the location, which is a strong and possibly unnecessary limiting factor

for the spatio-temporal predictive model. We believe that every participant would be able to provide

a more effective model if the location of the target were made available. Is it possible for the next

release of the data to include such information?

A27 Same as A26.

Q28 Would it be possible to have more information about when the auxiliary data is available to the

model? Is it safe to assume that, except on specific cases to test for robustness, at every generation

step m all the auxiliary data until the “prediction generation time” will be available?

A28 Yes, though delays between latest auxiliary data and prediction generation time may vary.

Page 7: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

7

Q29 Could you confirm that every submission evaluation starts with training (ie m=0) and that the

time spent during training is not counted in the OEET score? Is there an upper bound on the time the

model can train on your platform before being used for the predictions?

A29 Please reread the Rules of Contest. The ‘O’ in OEET stands for ‘overall’, both training and

prediction overall must be within the time bound given. How your submission chooses to apportion

that time is entirely up to you.

Q30 The original data contains the time of the sample in unix time (seconds since 1970), which is a

convention that allows absolute time (dates with arbitrary accuracy) to be represented as numbers.

We have reindexed this absolute time as a sequence of integers starting from 0, which corresponds

to 1,246,492,800 seconds since 1970. The interval between 2 time indices is 300 seconds, e.g, a file

containing 12 frames has time indices [0, 1, 2, ..., 11], corresponding to a duration of $11*300=3300$

seconds. The files and contiguous sequences have a variable number of frames. The goal is to predict

11 frames in the future? in the contest platform you will give contestants the unixtime and NOT the

index?

A30 We provide unixtime time stamps, how you index it in your submission code is up to you. Please

pay attention to the README.txt file and instructions about checking the horizon value first.

Q31 It seems that (i) Tensorflow 1.5 cannot be used with CUDA 8, and that downgrading to

Tensorflow 1.4 requires cudnn 6 which is not installed. Has anyone else reported the problem?

A31 cuDNN v.5.1 for CUDA 8.0 is installed in the docker instance on the evaluation platform (and

therefore does not have to be included in the submission). This version is matched to the Tensorflow

1.0.1 version (seen in the latest Dockerfile).

Q32 We notice that Dask library is available in your docker for python2 but not for python3. We are

currently supposing that this a mistake and that we can assume that Dask library will also be

available for python3 environment. Could you confirm our assumptions? Can you add "pip3 install

dask" line in the dockerfile for the next version of the starting kit?

A32 Dask is now available for both Python 2 and 3. No version updates are foreseen at this point. See

updated version of GETTING_STARTED.txt for help on how to add cuDNN to your docker instance on

your own equipment.

Page 8: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

8

Q33 We In the example script, we notice that train and adapt files are processed and cached to disk

for speed improvement at the following steps, can you confirm that this is indeed allowed in the

production environment? Can you tell us what is the available space on disk reserved for caching

and if we can assume that this space is indeed conserved from one processing step to another?

A33 The space available from one step to another depends on your submission (what it does) as well

as machine memory limitations and initial cache settings. At best you should monitor such settings

(note that your code has no sudo privileges).

Q34 Can we initialize some program/thread at the first processing step and keep those programs

running for a long period without having to restart it at every processing step. One typical example of

such always running thread would be "in-memory caching". Is this allowed?

A34 We do not actively kill running processes initiated by your submission in between intermediary

steps. Each step begins (at least) one additional process. Whether you cache on disk (you may use the

same folder as predict.sh) or use inter-process communication, your code must monitor disk and RAM

usage (without sudo). An accumulation of active processes as steps progress is not recommended.

Q35 Looking at the sample data, we observe that some lines show a value that is equal to 0 for quite

long periods, we are wondering if those should be considered as missing data or not. The manual

states that missing values should appear in the matrices as NaN, but we haven't seen any nan in the

samples data. So we wonder if the zeros are actually the NaN.

A35 Missing target data (whether NaN or gaps in timestamps) is likely to appear in upcoming phases.

0 values for long periods is part of the challenge.

Q36 Can we assume that the adapt data of step i have necessary a datetime that is inferior to the

data of step j with j>i ? As you clearly explain that you want to emulate real-time condition, we

believe that the assumption is correct, but we'd like to get a confirmation.

A36 Yes, for the target data it should be this way.

Q37 Can we assume that there is no missing values in the main input data? The manual clearly states

that auxiliary data could be missing, but the text is less clear for the main data.

A37 Missing target data (whether NaN or gaps in timestamps significantly greater than 5 mins) is

likely to appear in upcoming phases.

Page 9: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

9

Q38 What does the data represent. Do the values in the 'X' data represent power

production/capacity/excess/transfer or something else?

A38 The data represents estimates of flow (interpretable as ‘transfer’).

Q39 As we understand the data consists of 3 dimensions: x/y coordinates and time. While the time

steps seems to be defined, it not so clear about the geospatial scaling. Do different datasets

represent different locations? If so, is the space scale preserved across datasets, or different spatial

scaling is used all over?

A39 Each set has different geospatial scaling.

Q40 In the competition description is written: "Datasets will consist of time series recording weather

conditions and parameters of energy grid operations". Is there such an additional data given or

should it be considered?

A40 Please, see description of auxiliary data.

Q41 What is the role of the adapt datasets? The introduction document says that there are several

phases and additional data will be provided for every phase. It is not clear what kind of data it will be

and how we should iterate through it. Can you please provide more details?

A41 Please re-read the documentation provided.

Q42 What is the copyright on the dataset, can it be shared freely, or reused for purposes other than

the competition (for example as an example in a course)?

A42 The starting kit is a compilation of material restricted to the purpose of evaluation in the context

of this competition. As some of the material cannot be restricted in such a manner (see

DISCLAIMER.txt in the Starting Kit), you may reuse such material according to its original license.

Q43 The auxiliary data shows a quite short span of GOSAT data (with 3 or 5 days it is literally

impossible to fit models, especially with slowly changing values such as atmospheric GHG column

measurements) and no OPSD 60 data. We understand that the auxiliary data in the test environment

might look completely different. But can we at least assume that measurements available during the

‘adapt’ steps are also available in the ‘train’ data set? If that were the case, we would for now

manually download further GOSAT and OPSD 60 values to test our models.

Page 10: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

10

A43 You may manually download further data from any of the auxiliary data sources, of course.

While the train dataset is indeed representative of the adapt step data, please re-read Starting Kit

text files, especially about non-stationarity.

Q44 The OPSD60 data set comes with columns with forecast data, e.g. ‘DE_50hertz_wind_forecast’

or ‘DE_amprion_solar_forecast’. However, the values provided seem not to be the day-ahead values,

viz. a forecast for day X generated on day X-1 is only made available to our calculations on day X. In

real circumstances, values would already be provided a day ahead, e.g. 50 Hertz is computing their

values at 9:00 of the previous day (and even publishing them to the public at 18:00 of the previous

day). For a realistic scenario, it would be beneficial to provide these 6 time series “lagged” compared

to the others.

A44 The forecast sub-series within the OPSD60 dataset (of which there are more than 6) are

potentially useful proxies for step-ahead prediction or, rather, deciding which lead they are most

useful for ("day-ahead", that is up to you). The forecast accuracy varies among these series. There is

insufficient evidence, supported by the data itself and consistently applicable to each of the "forecast"

sub-series, to correct the OPSD series compilation process (which is quite complex, covering multiple

years) in the manner suggested.

Q45 Could you give us an order of magnitude estimate of the likely number of steps involved

(1,10,100, 1000 etc…) so that we can decide whether the model is likely to run within six hours and

therefore worth submitting.

A45 In the current phase you can expect approximately 600 steps total along with a horizon of 12.

Q46 What is there to do about problems building the interactive notebook Docker container in the

Starting Kit using the updated evaluation Dockerfile as a base? Are there Xm1, Xm2, Xm3, Xm4

folders in the test platform as mentioned in the /notebook interactive scripts in the Starting Kit?

Where is the /adapt folder on the test platform and which data is used for pre-ranking?

A46 Read carefully the top-level README.txt in the top folder of the Starting Kit (a file whose text

begins with “Starting Kit, H2020 Big Data Horizon Prize”) that provides guidance as to questions

about formatting and data. The interactive docker and related instructions under the /notebook

subfolder are optional, for extra convenience and are provided “as-is” (see readme texts inside the

/notebook folder for disclaimers).

There are no Xm2, Xm3 and Xm4 currently foreseen. Where these sub-folders present, they would

signify training data rather than adapt data (and their presence would be consistent with the top-

level README.txt instructions). Please refer to the top-level README.txt (especially lines 68-118) for

guidance for data formatting, (again, not /notebook folder files) as well as the Rules of Contest for

questions about pre-ranking.

Page 11: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

11

The top-level README.txt specifies how training data is to be found. Please pay particular attention

to lines 92 to 89. For the auxiliary data, the containing folder X* under /train/aux/<sourcea> can have

any value for *. For our convenience a containing folder “Xm1” has multiple records, while “X0” has a

single record. At any rate, the contents of /train are all available at the execution of the initial step

and may be used. On the other hand /adapt/aux files are presented incrementally, as are the target

data files.

Q47 Is it required to use predictSpatioTemporal.py as in the starting kit? Is it required to use Python?

A47 Submissions don’t have to use Python at all (as per Rules of Contest). You must respect the syntax

of predict.sh, but that script can call anything else that the docker and your submission support. If you

choose Python as your main scripting language, you certainly don’t have to use

predictSpatioTemporal.py (that is sample code only, you may ignore it). On a related note, if you use

python3 in combination with the sample code classes, exercise great care with compatibility, paths,

class names, etc.

Q48 Will auxiliary data have the same basic data sources in the test platform as in the starting kit? Is

the auxiliary data presented with the same frequency as in the starting kit? Will missing values only

be gaps in provided data? If not, how will missing values be encoded ('NaN', 'NA', ... ?) ?

A48 The main determinant of the frequency (and time distribution) of auxiliary data is its original

availability: you may expect all sources listed so far, in both train and adapt, except opsd30 and

NOAA_HYCOM, whereas the remaining NOAA source has a relevant end-of-availability date. See top-

level README.txt for further details under heading ‘Auxiliary Data’. Whether or not any of the

auxiliary sources are ‘worth’ processing is a decision that your submissions may make as they execute

(the top-level README.txt, paragraph beginning with line 44, please re-read). In the same

README.txt, missing values are specified as NaNs (line 23), - it does not say that only gaps (or all

gaps) are provided as NaNs: exercise caution.

Q49 How many files in the adapt folder do you mean by 600 steps with a horizon of 12? Where is

taskParameters.ini found during test platform runs relative to /train and /adapt? Does only the adapt

data count towards an RMSE score? How can we know how a submission’s RMSE compares to the

persistence model?

A49 A step, in the Rules of Contest requires prediction of a certain horizon. Therefore 600 steps

implies 600 files. The top-level README.txt in the Starting kit is clear on the fact that you should read

the .ini file value, and where it is found (see line 85). Your script predict.sh (top level of zip as in the

sample submission in starting kit) should respond to being called, and output a file of required

horizon. The top-level README.txt in the Starting kit (as well as Rules of Contest) explains that,

indeed, the adapt data only (and not train) forms the basis of the RMSE score. Your submission may

calculate the persistence RMSE score as it executes at each step and (possibly) react based on this

comparison.

Page 12: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

12

Q50 What is the role of the adapt datasets? The introduction document says that there are several

phases and additional data will be provided for every phase. It is not clear what kind of data it will be

and how we should iterate through it. Can you please provide more details?

A50 The adapt data can be thought of as incrementally presented “test” data. Phases signify pre-

ranking and final validation (as in Rules of Contest). ‘Adapt’ does not refer to a separate dataset but

a portion of the data that is assigned as “test” data for each phase. Whereas “training” data is

provided at the beginning of the evaluation of a submission, “adapt” data (both target and auxiliary)

is provided in incremental steps. Read top-level README.txt in the starting kit under the heading

“How submissions are expected to access data:” - inspecting the sample persistence submission in the

Starting Kit is also helpful as well as the FAQ.

Q51 What common errors have submissions so far exhibited in their data access?

A51 Directory and file paths: do not hard code data paths. Refer to the file predict.sh under

/notebook/sample_submission/persistence/submission_sample_persistence.zip which describes the

call syntax of the entry point to your own 'predict.sh' script. Among its arguments are top paths of

input and output data. Do not assume trailing '/' in the paths given as bash arguments. The top-level

README.txt file in the Starting-Kit describes the subdirectory structure of the input data on the

platform (it is not the same as the directory structure of the Starting Kit itself).

Here is a compact description of the sub-paths relative to the input data folder as they actually are.

************************

taskParameters.ini

/adapt/X*.h5 (* depends on step, starts with X0.h5 then X1.h5 etc.)

/adapt/aux/opsd15/X*/X*.h5 (first * depends on step, second * are different at each step)

/adapt/aux/opsd60/X*/X*.h5 (first * depends on step, second * are different at each step)

/adapt/aux/NOAA_NCOM_Region2/X*/X*.nc.tar.gz (first * depends on step, second * are different

at each step)

/adapt/aux/gosat_FTS_C0*S_2/X*/X*.h5 (the first * is 1,2 or 3, second * depends on step, third * are

different at each step)

/train/Xm1/X0.h5

/train/aux/opsd15/Xm1/X*.h5

/train/aux/opsd60/Xm1/X*.h5

/train/aux/NOAA_NCOM_Region2/X0/X*.tar.gz

/train/aux/gosat_FTS_C0*S_2/X0/X*.h5

************************

Page 13: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

13

Hard coding paths based on starting kit examples may lead to errors. Specifically, this line is a hard

path that throws an error: h5py.File(self.data_dir + '/train/Xm1/X1.h5','r')

Do not remove or alter the final line 'makeready.sh $3/Y$1'. Do not use default paths mentioned in

comments (those are suggestions for off-platform testing). Please note that you may not overwrite

output files once makeready.sh is called.

CLARIFICATION: as seen in /sample_data and /sample_submission directories we do not use .hdf5 or

.hdf as file extensions for HDF5 files but rather we use, and you should use: .h5

Q52a What common errors have submissions so far exhibited in their code path management?

A52a Although a docker script has been released, inadvertent code path errors can still occur (e.g.

libraries and files you may have present on your own setup but not contained in the submission zip file

and not on the platform). You must ensure that all the needed code is contained within your zip,

outside libraries shown in the toplevel Dockerfile in the Starting Kit and the CUDNN library (see FAQ

and Starting Kit for details, note date of docker build). No sample code (e.g. .py files from Starting Kit)

is present on the platform: only makeready.sh can be assumed to be provided.

Q52b What common errors have submissions exhibited in their version compatibility management?

A52b Please check version compatibility of software. For example, when using sample code from the

Starting Kit written for Python 2, while intending to run your scripts under Python 3, you must make

sure all code that is called is checked and upgraded for Python 3 compatibility.

GPU users: particular attention needs to be paid to Tensorflow versions (see latest Starting Kit and

dockerfile), as well as flag settings for Keras and other related GCGPU software: you must make your

code compatible with the versions present.

Q53 Where is the public_data mentioned in a Readme file in the starting kit: "DATA — Directory

“sample_data” contains a mini-dataset for debug purposes. Replace sample_data by public_data, if

you want a larger dataset”?

A53 There is no other dataset provided for off-platform evaluation than the one found in the Starting

Kit under folder /sample_data. Sample auxiliary data is found under folder /auxiliaryData. See top-

level README.txt file in the Starting Kit for instructions on data directory structure on platform.

Additionally the file predict.sh under

/notebook/sample_submission/persistence/submission_sample_persistence.zip describes the call

syntax of the entry point to your code; among its arguments are top paths of input and output data.

Otherwise, regarding the "larger dataset" but also in general, the interactive docker and related

instructions under the /notebook/See4C_starting_kit subfolder are optional, for extra convenience

Page 14: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

14

and are provided “as-is”. The phrase means that you can augment the sample data in the starting kit

from public sources (or emulate it) for practice purposes.

Q54 Can we expect NOAA NCOM data for both train and test at a majority of the steps on the test

platform?

A54 The NOAA NCOM source coverage duration overlaps most of the adapt and train periods of the

upcoming phases on the platform. By 'adapt period duration' we mean the time range between the

last train target data sample and last sample in the last adapt step of a phase.

Q55 Will most observations still have 5-minute gaps and will the missing data be missing for a

multiple of 5 minutes?

A55 Most Inter-Sample Intervals (ISIs) in the target data are 5 minutes, but not all ISIs can be

assumed to be multiples of 5 minutes. That said, keep in mind (citing from top-level README.txt in

Starting Kit): "the forecasts are always to be given at 5 minute increments."

Q56 Can the predict.sh script included in a submission be placed at a location other than the root

folder of the submission itself?

A56 No, it cannot. As mentioned in A49 it must be located at the root of submission folder. Please

make sure not to introduce additional folder levels in the process of compressing your submission into

a .zip file. Some operating systems may automatically add a folder structure to the .zip file, so please

double-check your .zip file before submitting/uploading.

Q57 In some of the distributed materials predict.sh is called using step argument number 1 to M (see rules and starting kit GETTING-STARTED.txt). In others (especially the sample submission code) step 0 is explicitly mentioned. Also the sample adapt files starts from 0 are well. Could you please clarify how steps are indexed? A57 predict.sh is given indices (of steps) as an argument. The platform loop that increments through

the steps m is 0-indexed (it starts at 0).

Q58 The starting kit contains examples of outputs formatted according to two different specifications. Could you specify which one should be followed so as to produce a well formed output that the platform will be able to score? A58 For format information, see the more comprehensive discussion in A61 below.

Any code in the starting kit that does not produce a 2D matrix representation for the X component of

the output (e.g. notebook/See4C_starting_kit/sample_code ) should NOT be followed to format a

Page 15: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

15

submission output. Following this code would create an output file that the platform will reject as an

output formatting error.

Q59 How should one manage submissions that try to use the Python package management system PIP, which appears to be incompatible with certain versions of python3 runtime? A59 We have patched the python3 environment of the testing platform in order to fix this

incompatibility. However, if you are using PIP to import libraries that are not available on the

platform nor part of your submission you are in violation of the Rules of Contest and your submission

will fail because the testing platform has disabled access to the internet.

Q60 How does the platform manage the stdout and stderr streams in order to generate the 20 lines of log that are offered as feedback? A60 predict.sh is executed in such a way that the truncated log combines stdout and stderr and this

behaviour is not going to be changed. Attempts to circumvent the 20 line limit by printing (to either

stream) lines of abnormal length will be interpreted as fraudulent and dealt with as per the Rules of

Contest, see sections 8.5 and 8.10 in particular. Similarly for any attempt to write (to either stream)

information gleaned from the platform and not made explicitly available to all contestants as part of

the organization of the prize.

Q61 Could you provide more information on the format of input data files on the platform and the required format of the output files? A61 File formats of target data

Glossary:

"h5 - numeric"

any of the HDF5 data types supported for interoperability (integer as well as floating point). See

https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml for details.

"h5 - numeric (float)" we mean only the float types among the "h5 -numeric" types. e.g.: 64-bit

floating point, 32-bit floating point.

"h5 - numeric (integer)" we mean only the integer types among these. e.g.: 32-bit integer.

“matrix (h5 -numeric)” content of an HDF5 ‘type’ field designating a matrix of the paired data type of

the ‘value’ field. e.g.: ‘float matrix’, ‘int32 matrix’.

“matrix (h5 -numeric (float))” content of an HDF5 ‘type’ field designating a matrix of the paired data

type of the ‘value’ field where that data type is h5- numeric (float). e.g.: ‘float matrix’.

<xyz> means not literally ‘xyz’ but that ‘xyz’ is merely a description of <xyz>.

Page 16: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

16

(‘xyz’) means xyz, literally.

Input target data file formatting

The input target data have an HDF5 structure of types/values that is hierarchical and meant to be

able to represent spatio-temporal data in general. The output data file format (see below) is meant

to be a simplified, non-hierarchical format similar to the input file format but containing only the

variable of interest in the challenge task (a numerical matrix). By input target data we mean “RTE”

.h5 files, not auxiliary data (the formats of which can be determined from Starting Kit auxiliary data).

Input target data, both /train and /adapt, both on the platform and the starting kit, has the following

structure:

X*.h5

|-- X

|-- type = (‘scalar struct’)

|-- value

|-- X

|-- type <matrix (h5 -numeric (float))>

|-- value <see "X*.h5/X/value/X/value" below>

|-- t

|-- type <matrix (h5 -numeric)>

|-- value <see "X*.h5/X/value/t/value" below>

|-- aux <not used>

|-- hat <not used>

|-- metadata <not used>

Numerical contents of interest:

X*.h5/X/value/X/value

starting kit: size <number of time steps> x <number of lines> x 1

variable type <h5 - numeric (float)>

platform: size <number of time steps> x <number of lines>

variable type <h5- numeric (float)>

Page 17: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

17

X*.h5/X/value/t/value

starting kit: size <number of time steps> x 1

variable type <h5- numeric>

platform: size <number of time steps> x 1

variable type <h5- numeric>

Output file formatting

Each output file "Y*.h5", where * is 0,1,2,3.. as required (see predict.sh in sample .zip submissions in

Starting Kit) is an HDF5 file with one mandatory field named 'X' at the top level.

X should be a strictly 2 dimensional numerical floating-point matrix, i.e. have the following structure.

Y*.h5

|-- X

|-- type <matrix (h5 -numeric (float))>

|-- value <size <HORIZON> x <NLINES> , variable type <h5 -numeric (float)> >

|-- t <values are not used, but the field must be present>

The .zip sample submissions in the starting kit output a simplified HDF5 structure which is acceptable

(and recommended), without (/type,/value) pair encoding of matrices, namely:

Y*.h5

|- X <<size <HORIZON> x <NLINES> , variable type <h5 -numeric (float)> >

|- t < size <HORIZON>, variable type <h5 -numeric> >

Note: /value and /type are written by hdf5 libraries and should not be written explicitly by your code.

Page 18: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

18

Numerical contents of interest:

Y*.h5/X/value OR Y*.h5/X

The number of rows in X must be equal to "HORIZON" and number of columns (NLINES) equal to the

number of lines present in the target data input files X*.h5. It must contain no NaN values.

All other fields except X and t will be ignored.

Y*.h5/t/value OR Y*.h5/t

While the values of t are not used (as they are already implicit from the prediction task), t must

contain either an array with exact numerical values <1, 2, 3,.. HORIZON> (recommended), as in the

.zip sample submissions in the starting kit, or a valid null value as seen in starting kit input data in

fields such as X*.h5/X/value/aux/value (caution!).

Additional notes:

1) to save space, on the platform, the h5- numeric variable type of X*.h5/X/value/t/value is 32-bit

integer

2) upon reading input files, it is recommended to cast data contained therein to your desired working

variable type, before using these values.

Q62 Can I use the library matplotlib? A62 The library "matplotlib" cannot be used as it has a dependency with the _tkinter library, not

present on the platform.

Q63 GETTING-STARTED.TXT in the starting kit does not refer to /base/Dockerfile at all so do we just

ignore it? A63 Yes, the base/Dockerfile can be ignored; its use would be limited to testing basic Docker build

functionality on your system.

Q64 My attempt to run my software on the platform failed. Can I have a new attempt? A64 No. Any attempt to run your software on the platform counts, whether it ends successfully or in

an error, timeout etc. Please do not attempt to exceed the three attempts on the platform. Exceeding

the three attempts can be considered a breach of the Rules of the Contest (section 7.1) and lead to

exclusion from the contest.

Page 19: H2020 Inducement Prize: Big Data technologies H2020 Work ... · MB are allowed. The working software implies that any data file used by the submission must be accessible to the submitted

20/04/2018

19

Q65 Can we assume that your environment is setup like nvidia-docker as described in the GETTING-

STARTED.txt so as to have a GPU environment accessible to competitors?

A65 Please see the answers to questions A17 and A52b, and section 9 of the Rules of Contest.

Q66 Is there any possibility that multiple submissions might run at the same time, sharing the hardware? A66 See answer A17. The platform allocates an entire p2.8xlarge instance for up to 6 hours to each attempt and does not share the same instance with other participants while running.

Q67 Following a successful attempt on the platform I see a performance score displayed. Can you tell me what is the RMSE of the persistence model on the platform data, and the OEET time elapsed? A67 The performance score (RMSE) is intended for you to improve your software before submitting it in the Participant Portal. The fact that your attempt was successful means that the total execution time is within the 6 hours' limit (as you would otherwise have encountered a timeout error). We do not intend to release more information on the successful run. See section 7.2 of the Rules of Contest, and the footnote 9 on page 7.

Q68 If we get a timeout error, does that ensure that our code has no other errors?

A68 Please check again input paths in README/FAQ as well as output file formatting instructions (with behavior of sample *.zip submissions in the starting kit as example). To better leverage the feedback received upon errors, it is recommended that your code have clear error-resulting exits at critical points in the execution, e.g. loading of training data, training on that data, making a forecast, etc. If your code does not exit upon execution error(s), timeouts may result.