Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
20/04/2018
1
H2020 Inducement Prize: Big Data technologies
H2020 Work Programme 2016-2017, Topic identifier: BigDataPrize-01-2017
Questions and Answers
Q1. When sending the source code, will the intellectual property of the software remain to the
participant?
A1. The software's IPRs will remain with the contestant. For additional details please refer to Chapter
8.3 of the Prize's Rules of Contest
http://ec.europa.eu/research/participants/data/ref/h2020/other/prizes/contest_rules/h2020-prizes-
rules-big-data_en.pdf
Q2. How do you guarantee the confidentiality of the jury and of the access to the source code in the
evaluation of our software?
A2. The experts that will be hired to support the EC in the evaluation of the prize will indeed be
contractually bound to non-disclosure and confidentiality for all proceedings and software artefacts
that will be evaluated.
Q3. Is it allowed to load models from an online location? What locations are allowed (gitHub seems
like a natural location)? This would enable participants to get the most out of the training and
validation data while keeping the submission below 1 MB.
A3. No. This is not allowed. The software, with the characteristics indicated in the Rules of Contest
document, shall be uploaded directly on the contest platform as a package of maximum 1MB.
Q4. Will the test data be used to pre-rank teams? It is clear that a different time period will be used
to assess the final ranking but it is unclear if there is a difference between the test data and the data
used to pre-rank teams.
A4. There are three different datasets overall: starting kit (containing sample data for testing
purposes), contest platform data, evaluation platform data (which is the final actual test), please
consult the Rules of Contest again for which dataset applies to what phase/purpose.
20/04/2018
2
Q5. Is it allowed to include small trained models in the submission as long as the total size doesn't
exceed 1 MB?
A5. Anything that fits in standalone submission of max 1 MB is allowed as long as it is consistent with
the Rules of Contest.
Q6. Will there be a designated folder to store models locally or is it up to the participants to store the
models on an online location?
A6. Models can be stored locally between prediction steps, within a single submission to the platform.
However, please note that each individual submission to the platform is standalone and independent
from previous ones, and no data can be retrieved from the platform.
Q7. Our approach when building production forecasting systems is, usually, to pre-build as much of
the model as possible in order to reduce calculation times. Is it possible to include a data file
containing data for a pre-built model within our submission?
A7. Submission files that comply with the Rules of Contest and fit in standalone submission of max 1
MB are allowed. The working software implies that any data file used by the submission must be
accessible to the submitted software without requiring any intervention on the part of the contest
platform. Additionally, reliance on an auxiliary data file should comply with the requirement that the
submission be intelligible to an expert software engineer, see admissibility conditions related to the
working software on p. 5 of the Rules of Contest.
Q8. Will the testing data be of similar size as the training/validation data? If not, can you please give
an estimate of the expected difference in file sizes?
A8. The starting kit data, meant for testing, is intended to be smaller – we can't release more details
at this point.
Q9. Is it possible to add PyTorch to the suite of available software (http://pytorch.org/)? It is
becoming the serious alternative to Tensorflow/Keras (http://www.fast.ai/2017/09/08/introducing-
pytorch-for-fastai/).
A9. No additional languages or libraries/frameworks will be added.
20/04/2018
3
Q10. In the requirements only the following languages are listed: C, C++, Python 2, Python 3, Octave,
Julia, R. Can I expected other languages to be added later?
A10. The list contained the Rules of Contest document is final and we do not plan to add any other.
Q11. Should every adapt step prediction over a 60-minute interval be completely independent of
earlier time steps? The example code seems to indicate this is not the case. The
predictSpatioTemporal.py file contains a save model option. Can we use the output directory as the
cache directory between time steps and assume that this data will not be modified between time
steps?
A11. You may cache files in the output directory.
Q12. If we provide only the source code, how does the code get compiled? Would you run a script
that we provide that compiles the code before running the contest? We are likely to be using C++.
A12. We run no scripts outside of predict.sh. You should include the execution of compilation of code
in the submission itself, to be run when predict.sh first executes (safest). Alternatively you can include
binaries, but then you should include clear instructions of how this compilation process would be done
inside the provided docker at evaluation phase of the contest to arrive at the provided binaries
reproducibly. Note that the evaluation phase has no access to the Internet, so please ensure that all
source code necessary to compile it under platform evaluation conditions is found in the submission.
The maximum size requirement must be respected.
Q13. Is it right to assume that we provide a set of pre-compiled binaries as part of Section 2 of Part B
and that these pre-compiled binaries are used for the test or will the model have to be first compiled
on your instance?
A13. Technically pre-compiled binaries can be included, but if not compiled on machine, they are not
intelligible to an expert software engineer and therefore do not fulfil the admissibility conditions
related to the working software (p. 5 of the Rules of Contest).
Q14. On the test, am I right in thinking that initially within the data directory there will only be the
training data? After one step of the test you will place a file in the adapt directory and call our
program again. Our program can then use the extra information provided to make any adaptations
to the model before predicting the next time step. Your software will then iterate around placing
one file in the adapt directory and calling our software again until all time steps have been
completed?
A14. Yes, not just one file of target data at each adapt step but also additional auxiliary data files
synchronized to that step.
20/04/2018
4
Q15. The starting kit mentions that the “Open Power Systems Data" has been converted to .h5
format and sliced to be in sync with the target data. I do however only see one folder in
starting-kit\auxiliaryData\train\_aux\opsd15 for the 289 files in
starting-kit\notebook\See4C_starting_kit\sample_data\train\Xm1. The adapt opsd data
starting-kit\auxiliaryData\adapt\_aux\opsd15 does contain the 69 folders one would expect given
the 69 files in starting-kit\notebook\See4C_starting_kit\sample_data\train\Xm1. Will the starting kit
be updated soon to include all opsd train data?
A15. No, the train target data was sliced for your convenience. The train target and auxiliary data are
not synchronized by steps: you may use them as you wish, for example you can either merge the
target .h5 data, re-slice it according to your wish or slice the opsd data yourself, depending how you
choose to train on data.
Q16. Will the test and verification tasks be the same as the one provided in the sample kit, namely
the forecasting of flow-related measurements over a number of high-tension lines. We understand
that the set of lines maybe different in the test and validation steps from the set of lines provided in
the starting kit. This will enable us to decide, for example, whether the use of the auxiliary data sets
is useful for the model.
A16. Yes, same task type but different data and possibly parameters.
Q17. Can you indicate the size of the test data set so that we can make informed decisions about
how best to use the amazon instance. (For example, it would be useful to know whether the data
provided can fit within a single GPU instance so that we can decide on the optimal processing pattern
for the data). In particular the size of the historical dataset that will be provided on which to train
the model.
A17. It is not up to the participants to allocate the AWS resources, they just upload the software to the platform. We allocate exactly one "P2.8xlarge" instance of AWS for each "run", and it is available for max. 6 hours. The participant can decide to use only part of this allocation. See the Annex of the Rules of Contest.
20/04/2018
5
Q18. Concerning the spatialization in the starting-kit: the two mapping matrices (44x44 and 64x64)
seem to map a 1d-array into a 2d 'raster'. Are the close points in the 2d map close also in terms of
measurements? Can I assume, here or in the real competition, that two adjacent cells in the resulting
44x44 matrix are adjacent also in the reality?
A18. No, there are no guarantees of closeness among cells in the sample code. Reordering the data to
increase forecast accuracy is part of the challenge.
Q19. Your ReadMe suggests that missing values will be coded as NANs. However, both the viewer
we are using (HD5View), and the library we are using to pull the data into our system seems to be
reading missing values as zeros. For example, in X19.h5, line 171 (timestamp 1248265500) appears
(based on the viewer and library that we are using) to be completely full of zeros (so clearly a missing
value). I wonder if you could just confirm whether that line does indeed have NANs in it (and it is our
ignorance that is causing the problem) or whether it is genuinely full of zeros.
A19. Nonstationarities can involve elements of time series that are zeros for substantive portions of
the overall record; sample data is only a snippet.
Q20. In the starting kit phase, I can recover by myself the function index-to-unixtime. In the contest
platform, will you give contestants the unixtime and not the index?
A20. We provide unixtime time stamps, how you index it in your submission code is up to you. Please
pay attention to the README.txt file and instructions about reading the horizon value first, not to
make assumptions about its value.
Q21. Is it possible to officially add participants to an existing team?
A21. It is not possible to add participants after the registrations have been submitted and closed.
Q22 Are we allowed to provide our own Dockerfile which includes necessary dependencies (e.g. C++
HDF5 library) ?
A22 Within the 1MB limit useful libraries such as interfaces can be included, check provided docker in
interactive mode for related binaries and libraries already present.
Q23 If providing a Docker file with necessary dependency is not allowed, can we download and
compile the necessary dependencies? Our main concern is using with C++ HDF5 library for
reading/writing data, and the library itself can take more than 2MB in compressed form, while the
submission is limited to 1MB.
20/04/2018
6
A23 Please read A22 above. We will look into this issue, for now either please interface to present
binaries or use a higher level language to read/write/ translate to a form you can more easily read.
Q24 Could you clarify on the status of ‘python cuda’ and ‘python opencl’. These packages would be
extremely helpful, but they are commented out in the Dockerfile – referring to be “packages with
problems”. Any chance we can get them running?
A24 Please check the latest docker file functionality.
Q25 Are we restricted to the packages present in the Dockerfile or can we load additional ones (at
least if they are available in the main/universe repositories)?
A25 You must include additional packages/code in the submission itself as per Rules of Contest.
Q26 It is with certain disappointment that we see the target data time series being stripped off their
geolocations. Thus, it all boils down to finding the needle in a haystack, i.e. a kind of big regression
analysis. What is the idea behind withholding that much prior knowledge? It is somewhat weird to
call it a geospatio-temporal forecasting competition then…
A26 The target data is presumed to have complex topological and causal relationships that require
extensive analysis, even if the geolocations are given. In a Big Data context, due to privacy and
confidentiality issues, anonimisation of data is common. To encourage the development of novel
methods within constraints of confidentiality, the geospatial information is anonimised.
Q27 Auxiliary data contains useful spatial information but the targets (the tension lines) are
anonimised with respect to the location, which is a strong and possibly unnecessary limiting factor
for the spatio-temporal predictive model. We believe that every participant would be able to provide
a more effective model if the location of the target were made available. Is it possible for the next
release of the data to include such information?
A27 Same as A26.
Q28 Would it be possible to have more information about when the auxiliary data is available to the
model? Is it safe to assume that, except on specific cases to test for robustness, at every generation
step m all the auxiliary data until the “prediction generation time” will be available?
A28 Yes, though delays between latest auxiliary data and prediction generation time may vary.
20/04/2018
7
Q29 Could you confirm that every submission evaluation starts with training (ie m=0) and that the
time spent during training is not counted in the OEET score? Is there an upper bound on the time the
model can train on your platform before being used for the predictions?
A29 Please reread the Rules of Contest. The ‘O’ in OEET stands for ‘overall’, both training and
prediction overall must be within the time bound given. How your submission chooses to apportion
that time is entirely up to you.
Q30 The original data contains the time of the sample in unix time (seconds since 1970), which is a
convention that allows absolute time (dates with arbitrary accuracy) to be represented as numbers.
We have reindexed this absolute time as a sequence of integers starting from 0, which corresponds
to 1,246,492,800 seconds since 1970. The interval between 2 time indices is 300 seconds, e.g, a file
containing 12 frames has time indices [0, 1, 2, ..., 11], corresponding to a duration of $11*300=3300$
seconds. The files and contiguous sequences have a variable number of frames. The goal is to predict
11 frames in the future? in the contest platform you will give contestants the unixtime and NOT the
index?
A30 We provide unixtime time stamps, how you index it in your submission code is up to you. Please
pay attention to the README.txt file and instructions about checking the horizon value first.
Q31 It seems that (i) Tensorflow 1.5 cannot be used with CUDA 8, and that downgrading to
Tensorflow 1.4 requires cudnn 6 which is not installed. Has anyone else reported the problem?
A31 cuDNN v.5.1 for CUDA 8.0 is installed in the docker instance on the evaluation platform (and
therefore does not have to be included in the submission). This version is matched to the Tensorflow
1.0.1 version (seen in the latest Dockerfile).
Q32 We notice that Dask library is available in your docker for python2 but not for python3. We are
currently supposing that this a mistake and that we can assume that Dask library will also be
available for python3 environment. Could you confirm our assumptions? Can you add "pip3 install
dask" line in the dockerfile for the next version of the starting kit?
A32 Dask is now available for both Python 2 and 3. No version updates are foreseen at this point. See
updated version of GETTING_STARTED.txt for help on how to add cuDNN to your docker instance on
your own equipment.
20/04/2018
8
Q33 We In the example script, we notice that train and adapt files are processed and cached to disk
for speed improvement at the following steps, can you confirm that this is indeed allowed in the
production environment? Can you tell us what is the available space on disk reserved for caching
and if we can assume that this space is indeed conserved from one processing step to another?
A33 The space available from one step to another depends on your submission (what it does) as well
as machine memory limitations and initial cache settings. At best you should monitor such settings
(note that your code has no sudo privileges).
Q34 Can we initialize some program/thread at the first processing step and keep those programs
running for a long period without having to restart it at every processing step. One typical example of
such always running thread would be "in-memory caching". Is this allowed?
A34 We do not actively kill running processes initiated by your submission in between intermediary
steps. Each step begins (at least) one additional process. Whether you cache on disk (you may use the
same folder as predict.sh) or use inter-process communication, your code must monitor disk and RAM
usage (without sudo). An accumulation of active processes as steps progress is not recommended.
Q35 Looking at the sample data, we observe that some lines show a value that is equal to 0 for quite
long periods, we are wondering if those should be considered as missing data or not. The manual
states that missing values should appear in the matrices as NaN, but we haven't seen any nan in the
samples data. So we wonder if the zeros are actually the NaN.
A35 Missing target data (whether NaN or gaps in timestamps) is likely to appear in upcoming phases.
0 values for long periods is part of the challenge.
Q36 Can we assume that the adapt data of step i have necessary a datetime that is inferior to the
data of step j with j>i ? As you clearly explain that you want to emulate real-time condition, we
believe that the assumption is correct, but we'd like to get a confirmation.
A36 Yes, for the target data it should be this way.
Q37 Can we assume that there is no missing values in the main input data? The manual clearly states
that auxiliary data could be missing, but the text is less clear for the main data.
A37 Missing target data (whether NaN or gaps in timestamps significantly greater than 5 mins) is
likely to appear in upcoming phases.
20/04/2018
9
Q38 What does the data represent. Do the values in the 'X' data represent power
production/capacity/excess/transfer or something else?
A38 The data represents estimates of flow (interpretable as ‘transfer’).
Q39 As we understand the data consists of 3 dimensions: x/y coordinates and time. While the time
steps seems to be defined, it not so clear about the geospatial scaling. Do different datasets
represent different locations? If so, is the space scale preserved across datasets, or different spatial
scaling is used all over?
A39 Each set has different geospatial scaling.
Q40 In the competition description is written: "Datasets will consist of time series recording weather
conditions and parameters of energy grid operations". Is there such an additional data given or
should it be considered?
A40 Please, see description of auxiliary data.
Q41 What is the role of the adapt datasets? The introduction document says that there are several
phases and additional data will be provided for every phase. It is not clear what kind of data it will be
and how we should iterate through it. Can you please provide more details?
A41 Please re-read the documentation provided.
Q42 What is the copyright on the dataset, can it be shared freely, or reused for purposes other than
the competition (for example as an example in a course)?
A42 The starting kit is a compilation of material restricted to the purpose of evaluation in the context
of this competition. As some of the material cannot be restricted in such a manner (see
DISCLAIMER.txt in the Starting Kit), you may reuse such material according to its original license.
Q43 The auxiliary data shows a quite short span of GOSAT data (with 3 or 5 days it is literally
impossible to fit models, especially with slowly changing values such as atmospheric GHG column
measurements) and no OPSD 60 data. We understand that the auxiliary data in the test environment
might look completely different. But can we at least assume that measurements available during the
‘adapt’ steps are also available in the ‘train’ data set? If that were the case, we would for now
manually download further GOSAT and OPSD 60 values to test our models.
20/04/2018
10
A43 You may manually download further data from any of the auxiliary data sources, of course.
While the train dataset is indeed representative of the adapt step data, please re-read Starting Kit
text files, especially about non-stationarity.
Q44 The OPSD60 data set comes with columns with forecast data, e.g. ‘DE_50hertz_wind_forecast’
or ‘DE_amprion_solar_forecast’. However, the values provided seem not to be the day-ahead values,
viz. a forecast for day X generated on day X-1 is only made available to our calculations on day X. In
real circumstances, values would already be provided a day ahead, e.g. 50 Hertz is computing their
values at 9:00 of the previous day (and even publishing them to the public at 18:00 of the previous
day). For a realistic scenario, it would be beneficial to provide these 6 time series “lagged” compared
to the others.
A44 The forecast sub-series within the OPSD60 dataset (of which there are more than 6) are
potentially useful proxies for step-ahead prediction or, rather, deciding which lead they are most
useful for ("day-ahead", that is up to you). The forecast accuracy varies among these series. There is
insufficient evidence, supported by the data itself and consistently applicable to each of the "forecast"
sub-series, to correct the OPSD series compilation process (which is quite complex, covering multiple
years) in the manner suggested.
Q45 Could you give us an order of magnitude estimate of the likely number of steps involved
(1,10,100, 1000 etc…) so that we can decide whether the model is likely to run within six hours and
therefore worth submitting.
A45 In the current phase you can expect approximately 600 steps total along with a horizon of 12.
Q46 What is there to do about problems building the interactive notebook Docker container in the
Starting Kit using the updated evaluation Dockerfile as a base? Are there Xm1, Xm2, Xm3, Xm4
folders in the test platform as mentioned in the /notebook interactive scripts in the Starting Kit?
Where is the /adapt folder on the test platform and which data is used for pre-ranking?
A46 Read carefully the top-level README.txt in the top folder of the Starting Kit (a file whose text
begins with “Starting Kit, H2020 Big Data Horizon Prize”) that provides guidance as to questions
about formatting and data. The interactive docker and related instructions under the /notebook
subfolder are optional, for extra convenience and are provided “as-is” (see readme texts inside the
/notebook folder for disclaimers).
There are no Xm2, Xm3 and Xm4 currently foreseen. Where these sub-folders present, they would
signify training data rather than adapt data (and their presence would be consistent with the top-
level README.txt instructions). Please refer to the top-level README.txt (especially lines 68-118) for
guidance for data formatting, (again, not /notebook folder files) as well as the Rules of Contest for
questions about pre-ranking.
20/04/2018
11
The top-level README.txt specifies how training data is to be found. Please pay particular attention
to lines 92 to 89. For the auxiliary data, the containing folder X* under /train/aux/<sourcea> can have
any value for *. For our convenience a containing folder “Xm1” has multiple records, while “X0” has a
single record. At any rate, the contents of /train are all available at the execution of the initial step
and may be used. On the other hand /adapt/aux files are presented incrementally, as are the target
data files.
Q47 Is it required to use predictSpatioTemporal.py as in the starting kit? Is it required to use Python?
A47 Submissions don’t have to use Python at all (as per Rules of Contest). You must respect the syntax
of predict.sh, but that script can call anything else that the docker and your submission support. If you
choose Python as your main scripting language, you certainly don’t have to use
predictSpatioTemporal.py (that is sample code only, you may ignore it). On a related note, if you use
python3 in combination with the sample code classes, exercise great care with compatibility, paths,
class names, etc.
Q48 Will auxiliary data have the same basic data sources in the test platform as in the starting kit? Is
the auxiliary data presented with the same frequency as in the starting kit? Will missing values only
be gaps in provided data? If not, how will missing values be encoded ('NaN', 'NA', ... ?) ?
A48 The main determinant of the frequency (and time distribution) of auxiliary data is its original
availability: you may expect all sources listed so far, in both train and adapt, except opsd30 and
NOAA_HYCOM, whereas the remaining NOAA source has a relevant end-of-availability date. See top-
level README.txt for further details under heading ‘Auxiliary Data’. Whether or not any of the
auxiliary sources are ‘worth’ processing is a decision that your submissions may make as they execute
(the top-level README.txt, paragraph beginning with line 44, please re-read). In the same
README.txt, missing values are specified as NaNs (line 23), - it does not say that only gaps (or all
gaps) are provided as NaNs: exercise caution.
Q49 How many files in the adapt folder do you mean by 600 steps with a horizon of 12? Where is
taskParameters.ini found during test platform runs relative to /train and /adapt? Does only the adapt
data count towards an RMSE score? How can we know how a submission’s RMSE compares to the
persistence model?
A49 A step, in the Rules of Contest requires prediction of a certain horizon. Therefore 600 steps
implies 600 files. The top-level README.txt in the Starting kit is clear on the fact that you should read
the .ini file value, and where it is found (see line 85). Your script predict.sh (top level of zip as in the
sample submission in starting kit) should respond to being called, and output a file of required
horizon. The top-level README.txt in the Starting kit (as well as Rules of Contest) explains that,
indeed, the adapt data only (and not train) forms the basis of the RMSE score. Your submission may
calculate the persistence RMSE score as it executes at each step and (possibly) react based on this
comparison.
20/04/2018
12
Q50 What is the role of the adapt datasets? The introduction document says that there are several
phases and additional data will be provided for every phase. It is not clear what kind of data it will be
and how we should iterate through it. Can you please provide more details?
A50 The adapt data can be thought of as incrementally presented “test” data. Phases signify pre-
ranking and final validation (as in Rules of Contest). ‘Adapt’ does not refer to a separate dataset but
a portion of the data that is assigned as “test” data for each phase. Whereas “training” data is
provided at the beginning of the evaluation of a submission, “adapt” data (both target and auxiliary)
is provided in incremental steps. Read top-level README.txt in the starting kit under the heading
“How submissions are expected to access data:” - inspecting the sample persistence submission in the
Starting Kit is also helpful as well as the FAQ.
Q51 What common errors have submissions so far exhibited in their data access?
A51 Directory and file paths: do not hard code data paths. Refer to the file predict.sh under
/notebook/sample_submission/persistence/submission_sample_persistence.zip which describes the
call syntax of the entry point to your own 'predict.sh' script. Among its arguments are top paths of
input and output data. Do not assume trailing '/' in the paths given as bash arguments. The top-level
README.txt file in the Starting-Kit describes the subdirectory structure of the input data on the
platform (it is not the same as the directory structure of the Starting Kit itself).
Here is a compact description of the sub-paths relative to the input data folder as they actually are.
************************
taskParameters.ini
/adapt/X*.h5 (* depends on step, starts with X0.h5 then X1.h5 etc.)
/adapt/aux/opsd15/X*/X*.h5 (first * depends on step, second * are different at each step)
/adapt/aux/opsd60/X*/X*.h5 (first * depends on step, second * are different at each step)
/adapt/aux/NOAA_NCOM_Region2/X*/X*.nc.tar.gz (first * depends on step, second * are different
at each step)
/adapt/aux/gosat_FTS_C0*S_2/X*/X*.h5 (the first * is 1,2 or 3, second * depends on step, third * are
different at each step)
/train/Xm1/X0.h5
/train/aux/opsd15/Xm1/X*.h5
/train/aux/opsd60/Xm1/X*.h5
/train/aux/NOAA_NCOM_Region2/X0/X*.tar.gz
/train/aux/gosat_FTS_C0*S_2/X0/X*.h5
************************
20/04/2018
13
Hard coding paths based on starting kit examples may lead to errors. Specifically, this line is a hard
path that throws an error: h5py.File(self.data_dir + '/train/Xm1/X1.h5','r')
Do not remove or alter the final line 'makeready.sh $3/Y$1'. Do not use default paths mentioned in
comments (those are suggestions for off-platform testing). Please note that you may not overwrite
output files once makeready.sh is called.
CLARIFICATION: as seen in /sample_data and /sample_submission directories we do not use .hdf5 or
.hdf as file extensions for HDF5 files but rather we use, and you should use: .h5
Q52a What common errors have submissions so far exhibited in their code path management?
A52a Although a docker script has been released, inadvertent code path errors can still occur (e.g.
libraries and files you may have present on your own setup but not contained in the submission zip file
and not on the platform). You must ensure that all the needed code is contained within your zip,
outside libraries shown in the toplevel Dockerfile in the Starting Kit and the CUDNN library (see FAQ
and Starting Kit for details, note date of docker build). No sample code (e.g. .py files from Starting Kit)
is present on the platform: only makeready.sh can be assumed to be provided.
Q52b What common errors have submissions exhibited in their version compatibility management?
A52b Please check version compatibility of software. For example, when using sample code from the
Starting Kit written for Python 2, while intending to run your scripts under Python 3, you must make
sure all code that is called is checked and upgraded for Python 3 compatibility.
GPU users: particular attention needs to be paid to Tensorflow versions (see latest Starting Kit and
dockerfile), as well as flag settings for Keras and other related GCGPU software: you must make your
code compatible with the versions present.
Q53 Where is the public_data mentioned in a Readme file in the starting kit: "DATA — Directory
“sample_data” contains a mini-dataset for debug purposes. Replace sample_data by public_data, if
you want a larger dataset”?
A53 There is no other dataset provided for off-platform evaluation than the one found in the Starting
Kit under folder /sample_data. Sample auxiliary data is found under folder /auxiliaryData. See top-
level README.txt file in the Starting Kit for instructions on data directory structure on platform.
Additionally the file predict.sh under
/notebook/sample_submission/persistence/submission_sample_persistence.zip describes the call
syntax of the entry point to your code; among its arguments are top paths of input and output data.
Otherwise, regarding the "larger dataset" but also in general, the interactive docker and related
instructions under the /notebook/See4C_starting_kit subfolder are optional, for extra convenience
20/04/2018
14
and are provided “as-is”. The phrase means that you can augment the sample data in the starting kit
from public sources (or emulate it) for practice purposes.
Q54 Can we expect NOAA NCOM data for both train and test at a majority of the steps on the test
platform?
A54 The NOAA NCOM source coverage duration overlaps most of the adapt and train periods of the
upcoming phases on the platform. By 'adapt period duration' we mean the time range between the
last train target data sample and last sample in the last adapt step of a phase.
Q55 Will most observations still have 5-minute gaps and will the missing data be missing for a
multiple of 5 minutes?
A55 Most Inter-Sample Intervals (ISIs) in the target data are 5 minutes, but not all ISIs can be
assumed to be multiples of 5 minutes. That said, keep in mind (citing from top-level README.txt in
Starting Kit): "the forecasts are always to be given at 5 minute increments."
Q56 Can the predict.sh script included in a submission be placed at a location other than the root
folder of the submission itself?
A56 No, it cannot. As mentioned in A49 it must be located at the root of submission folder. Please
make sure not to introduce additional folder levels in the process of compressing your submission into
a .zip file. Some operating systems may automatically add a folder structure to the .zip file, so please
double-check your .zip file before submitting/uploading.
Q57 In some of the distributed materials predict.sh is called using step argument number 1 to M (see rules and starting kit GETTING-STARTED.txt). In others (especially the sample submission code) step 0 is explicitly mentioned. Also the sample adapt files starts from 0 are well. Could you please clarify how steps are indexed? A57 predict.sh is given indices (of steps) as an argument. The platform loop that increments through
the steps m is 0-indexed (it starts at 0).
Q58 The starting kit contains examples of outputs formatted according to two different specifications. Could you specify which one should be followed so as to produce a well formed output that the platform will be able to score? A58 For format information, see the more comprehensive discussion in A61 below.
Any code in the starting kit that does not produce a 2D matrix representation for the X component of
the output (e.g. notebook/See4C_starting_kit/sample_code ) should NOT be followed to format a
20/04/2018
15
submission output. Following this code would create an output file that the platform will reject as an
output formatting error.
Q59 How should one manage submissions that try to use the Python package management system PIP, which appears to be incompatible with certain versions of python3 runtime? A59 We have patched the python3 environment of the testing platform in order to fix this
incompatibility. However, if you are using PIP to import libraries that are not available on the
platform nor part of your submission you are in violation of the Rules of Contest and your submission
will fail because the testing platform has disabled access to the internet.
Q60 How does the platform manage the stdout and stderr streams in order to generate the 20 lines of log that are offered as feedback? A60 predict.sh is executed in such a way that the truncated log combines stdout and stderr and this
behaviour is not going to be changed. Attempts to circumvent the 20 line limit by printing (to either
stream) lines of abnormal length will be interpreted as fraudulent and dealt with as per the Rules of
Contest, see sections 8.5 and 8.10 in particular. Similarly for any attempt to write (to either stream)
information gleaned from the platform and not made explicitly available to all contestants as part of
the organization of the prize.
Q61 Could you provide more information on the format of input data files on the platform and the required format of the output files? A61 File formats of target data
Glossary:
"h5 - numeric"
any of the HDF5 data types supported for interoperability (integer as well as floating point). See
https://www.loc.gov/preservation/digital/formats/fdd/fdd000229.shtml for details.
"h5 - numeric (float)" we mean only the float types among the "h5 -numeric" types. e.g.: 64-bit
floating point, 32-bit floating point.
"h5 - numeric (integer)" we mean only the integer types among these. e.g.: 32-bit integer.
“matrix (h5 -numeric)” content of an HDF5 ‘type’ field designating a matrix of the paired data type of
the ‘value’ field. e.g.: ‘float matrix’, ‘int32 matrix’.
“matrix (h5 -numeric (float))” content of an HDF5 ‘type’ field designating a matrix of the paired data
type of the ‘value’ field where that data type is h5- numeric (float). e.g.: ‘float matrix’.
<xyz> means not literally ‘xyz’ but that ‘xyz’ is merely a description of <xyz>.
20/04/2018
16
(‘xyz’) means xyz, literally.
Input target data file formatting
The input target data have an HDF5 structure of types/values that is hierarchical and meant to be
able to represent spatio-temporal data in general. The output data file format (see below) is meant
to be a simplified, non-hierarchical format similar to the input file format but containing only the
variable of interest in the challenge task (a numerical matrix). By input target data we mean “RTE”
.h5 files, not auxiliary data (the formats of which can be determined from Starting Kit auxiliary data).
Input target data, both /train and /adapt, both on the platform and the starting kit, has the following
structure:
X*.h5
|-- X
|-- type = (‘scalar struct’)
|-- value
|-- X
|-- type <matrix (h5 -numeric (float))>
|-- value <see "X*.h5/X/value/X/value" below>
|-- t
|-- type <matrix (h5 -numeric)>
|-- value <see "X*.h5/X/value/t/value" below>
|-- aux <not used>
|-- hat <not used>
|-- metadata <not used>
Numerical contents of interest:
X*.h5/X/value/X/value
starting kit: size <number of time steps> x <number of lines> x 1
variable type <h5 - numeric (float)>
platform: size <number of time steps> x <number of lines>
variable type <h5- numeric (float)>
20/04/2018
17
X*.h5/X/value/t/value
starting kit: size <number of time steps> x 1
variable type <h5- numeric>
platform: size <number of time steps> x 1
variable type <h5- numeric>
Output file formatting
Each output file "Y*.h5", where * is 0,1,2,3.. as required (see predict.sh in sample .zip submissions in
Starting Kit) is an HDF5 file with one mandatory field named 'X' at the top level.
X should be a strictly 2 dimensional numerical floating-point matrix, i.e. have the following structure.
Y*.h5
|-- X
|-- type <matrix (h5 -numeric (float))>
|-- value <size <HORIZON> x <NLINES> , variable type <h5 -numeric (float)> >
|-- t <values are not used, but the field must be present>
The .zip sample submissions in the starting kit output a simplified HDF5 structure which is acceptable
(and recommended), without (/type,/value) pair encoding of matrices, namely:
Y*.h5
|- X <<size <HORIZON> x <NLINES> , variable type <h5 -numeric (float)> >
|- t < size <HORIZON>, variable type <h5 -numeric> >
Note: /value and /type are written by hdf5 libraries and should not be written explicitly by your code.
20/04/2018
18
Numerical contents of interest:
Y*.h5/X/value OR Y*.h5/X
The number of rows in X must be equal to "HORIZON" and number of columns (NLINES) equal to the
number of lines present in the target data input files X*.h5. It must contain no NaN values.
All other fields except X and t will be ignored.
Y*.h5/t/value OR Y*.h5/t
While the values of t are not used (as they are already implicit from the prediction task), t must
contain either an array with exact numerical values <1, 2, 3,.. HORIZON> (recommended), as in the
.zip sample submissions in the starting kit, or a valid null value as seen in starting kit input data in
fields such as X*.h5/X/value/aux/value (caution!).
Additional notes:
1) to save space, on the platform, the h5- numeric variable type of X*.h5/X/value/t/value is 32-bit
integer
2) upon reading input files, it is recommended to cast data contained therein to your desired working
variable type, before using these values.
Q62 Can I use the library matplotlib? A62 The library "matplotlib" cannot be used as it has a dependency with the _tkinter library, not
present on the platform.
Q63 GETTING-STARTED.TXT in the starting kit does not refer to /base/Dockerfile at all so do we just
ignore it? A63 Yes, the base/Dockerfile can be ignored; its use would be limited to testing basic Docker build
functionality on your system.
Q64 My attempt to run my software on the platform failed. Can I have a new attempt? A64 No. Any attempt to run your software on the platform counts, whether it ends successfully or in
an error, timeout etc. Please do not attempt to exceed the three attempts on the platform. Exceeding
the three attempts can be considered a breach of the Rules of the Contest (section 7.1) and lead to
exclusion from the contest.
20/04/2018
19
Q65 Can we assume that your environment is setup like nvidia-docker as described in the GETTING-
STARTED.txt so as to have a GPU environment accessible to competitors?
A65 Please see the answers to questions A17 and A52b, and section 9 of the Rules of Contest.
Q66 Is there any possibility that multiple submissions might run at the same time, sharing the hardware? A66 See answer A17. The platform allocates an entire p2.8xlarge instance for up to 6 hours to each attempt and does not share the same instance with other participants while running.
Q67 Following a successful attempt on the platform I see a performance score displayed. Can you tell me what is the RMSE of the persistence model on the platform data, and the OEET time elapsed? A67 The performance score (RMSE) is intended for you to improve your software before submitting it in the Participant Portal. The fact that your attempt was successful means that the total execution time is within the 6 hours' limit (as you would otherwise have encountered a timeout error). We do not intend to release more information on the successful run. See section 7.2 of the Rules of Contest, and the footnote 9 on page 7.
Q68 If we get a timeout error, does that ensure that our code has no other errors?
A68 Please check again input paths in README/FAQ as well as output file formatting instructions (with behavior of sample *.zip submissions in the starting kit as example). To better leverage the feedback received upon errors, it is recommended that your code have clear error-resulting exits at critical points in the execution, e.g. loading of training data, training on that data, making a forecast, etc. If your code does not exit upon execution error(s), timeouts may result.