Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
H2020–EINFRA–2015–1 Page1of33
TechnicalNoteone-LearningServices,IntermediateVersion
Workpackage 5 VREInfrastructureandServicesDesignandDevelopment
Task 5.5 E-Learningapplicationservices
Author(s) PedroGonçalves
FabriceBrito
Terradue
Terradue
Reviewer(s) HelenGlaves
PauloNunes
NERC
SatCen
Approver(s) CristianoSlivagni ESA
Authorizer MirkoAlbani ESA
DocumentIdentifier EVER-ESTWP5-D5.5
DisseminationLevel Public
Status DrafttobeapprovedbytheEC
Version 1.0
DateofIssue 09/12/2016
H2020–EINFRA–2015–1 Page2of33
DocumentLog
Date Author Changes Version Status
14/10/2016 PedroGonçalves ToC 0.1 Draft
28/10/2016 PedroGonçalves Rationale and initialarchitecturaldesign
0.2 Draft
10/11/2016 PedroGonçalves ScopeandUseCases 0.3
14/11/2016 PedroGonçalves Data Agency, JupyterNotebooks,DataCubes
0.4 Draft
05/12/2016 PedroGonçalves Updates from revisionnotes
0.5 Draft
12/12/2016 PedroGonçalves Updateafterfinalreview 1.0 DrafttobeapprovedbyEC
H2020–EINFRA–2015–1 Page3of33
TableofContents
1Introduction................................................................................................................................................7
1.1 Purposeofthedocument..................................................................................................................71.2 Background.......................................................................................................................................71.3 Documentstructure..........................................................................................................................7
2 EarthSciencee-LearningServices.............................................................................................................82.1 Scope................................................................................................................................................82.2 Operationalscenarios........................................................................................................................9
2.2.1 Administratorofe-learningservices...................................................................................................92.2.2 Developerofe-learningmodules.....................................................................................................102.2.3 Managerofe-learningcourses.........................................................................................................102.2.4 Participantofe-learningcourses......................................................................................................11
3 Components...........................................................................................................................................123.1 Overview.........................................................................................................................................123.2 DataAgency....................................................................................................................................13
3.2.1 DataCatalogue..................................................................................................................................133.2.2 DataGateway...................................................................................................................................133.2.3 DataStoring......................................................................................................................................14
3.3 Webnotebooks...............................................................................................................................153.3.1 Jupyternotebookwebapplication...................................................................................................153.3.2 Kernels..............................................................................................................................................163.3.3 Jupyternotebookdocuments...........................................................................................................17
3.4 Datacube........................................................................................................................................184 Deployment...........................................................................................................................................22
4.1 Dataaccess......................................................................................................................................224.2 Provisioning.....................................................................................................................................224.3 Persistentstorage............................................................................................................................224.4 Scalability........................................................................................................................................234.5 Authentication................................................................................................................................23
5 e-LearningCatalogueandPortfolio........................................................................................................245.1 Sentinel-1productinformationandmetadata.................................................................................245.2 Sentinel-1productsubset................................................................................................................245.3 Sentinel-1changedetectionforfloodextent...................................................................................265.4 Sentinel-2vegetationindices...........................................................................................................29
H2020–EINFRA–2015–1 Page4of33
ListofFigures
Figure1–Scopeofthee-LearningService,ComponentsandrespectiveUseCases..................................................8Figure2–e-Learningserviceadministratorusecase..................................................................................................9Figure3–e-Learningservicemoduledeveloperusecase.........................................................................................10Figure4–e-Learningmodulemanagerusecase.......................................................................................................10Figure5-e-Learningcourseparticipantusecase......................................................................................................11Figure6-e-LearningServiceArchitecturalDiagram:fromServertoApplication.....................................................12Figure7-DataAgencyservicesforfacilitatingthedataflowtoapplications...........................................................13Figure8-DisplayingaNotebookfileinthebrowser.................................................................................................15Figure9-SimpleinteractiveexampleinJupyterNotebooks.....................................................................................16Figure10-Convertinganotebooktootheroutputformats.....................................................................................18Figure11-EarthObservationDataCubes.................................................................................................................19Figure12-LoadingdatafromthedatacubeinJupyter............................................................................................19Figure13-Retrievingarraydatafromthedatacube................................................................................................20Figure14-Plottingamulti-bandimagefromadatacubeinJupyter........................................................................21
ListofTables
N/A
H2020–EINFRA–2015–1 Page5of33
DefinitionsandAcronyms
Acronym Description
AGDC AustralianGeoscienceDataCube
AJAX AsynchronousJavaScriptandXML
API ApplicationProgrammingInterface
CEOS CommitteeonEarthObservationSatellites
DAG DirectedAcyclicGraph
DOI DigitalObjectIdentifier
EBS ElasticBlockStorage
EC2 ElasticComputeCloud
EO EarthObservation
ES EarthScience
ESA EuropeanSpaceAgency
EVER-EST EuropeanVirtualEnvironmentforResearch-EarthScienceThemes
FitSM StandardsforfreeandlightweightITManagement
FTP FileTransferProtocol
FTPS FTPoverSSL
GDAL GeospatialDataAbstractionLibrary
GUI GraphicalUserInterface
HDFS HadoopDistributedFileSystem
HTML HypertextMark-upLanguage
HTTP HypertextTransferProtocol
HTTPS HTTPoverTLS,HTTPoverSSL,andHTTPSecure
IDE IntegratedDevelopmentEnvironment
ICT InformationandCommunicationTechnology
IS IdentityServer
IT InformationTechnology
ITSM ITservicemanagement
JPEG JointPhotographicExpertsGroup
JSON JSObjectNotation
OGC OpenGeospatialConsortium
PDF PortableDocumentFormat
PNG PortableNetworkGraphics
PSNC PoznańSupercomputingandNetworkingCenter
REST RepresentationalStateTransfer
SAR SyntheticApertureRadar
H2020–EINFRA–2015–1 Page6of33
SLA ServiceLevelAgreement
SNAP SentinelApplicationPlatform
SSO SingleSign-On
SVG ScalableVectorGraphics
S3 SimpleStorageService
URL UniformResourceLocator
VM VirtualMachine
VRC VirtualResearchCommunity
VRE VirtualResearchEnvironment
XML EXtensibleMark-upLanguage
YARN YetAnotherResourceNegotiator
ApplicableDocuments
DocumentID DocumentTitle
Grant_Agreement-674907-EVER-EST
EVER-ESTGrantAgreement
EVER-ESTDELWP1-D1.1 ProjectManagementandQualityPlan
ReferenceDocuments
DocumentID DocumentTitle
EVER-ESTDELWP3-D3.1 VREDetailedDefinitionofUseCases
EVER-ESTDELWP5-D5.1 VREArchitectureandInterfacesDefinition
FitSM StandardsforfreeandlightweightITManagementhttp://fitsm.itemo.org/fitsm
H2020–EINFRA–2015–1 Page7of33
1Introduction
1.1 PurposeofthedocumentThemainpurposeofthisdocumentistodescribetheconsolidateddesignandthedevelopmentofthee-LearningServices according to the specific implementations and requirements outlined in D5.1. It describes the opensourcecomponentsselectedtosupporttheVRC’sinteractiveexplorationofEOdataandguidetheminadaptingtheirworkflowsfornewdatasources.ItaimstocoverthefullEOdatalifecycle,fromdataaccess,datacleansing,exploration,andreproducibilitytoinformationdissemination.ThisdocumentisanintermediateversiondeliveredinM14withthefinalversiontobedeliveredbyM18.
1.2 BackgroundEarthObservation sensors are currently generatinghuge amounts of data that is not easily integrated into theprocessingchainsoftheEVER-ESTVRCs.Toimprovetheirusage,itisnecessarytotraintheVRCsonthepotentialofthesedataflowsanddemonstratetheirapplicabilityforspecificusecases.Thedesignofthee-LearningServiceuses Web Notebooks as a way to develop interactive EO data applications that can use a large number ofprogramming languages, in the form of executable documents organized in units. It covers EO data sciencecomputingtechniquesthatwillsupportthetrainingandguidefuturedatascientiststoovercomethechallengesofincreasing EO data volumes and support their ability to validate, analyse, visualize, store and curate theinformation.
1.3 DocumentstructureThisintroductorychapteraimstoprovidekeyinformationtoreadersthatdonotbelongtotheEVER-ESTtechnicalteam inorder toprovide thecontextandplacementof thisdocument in theoverallWP5activities.ForamoregeneralperspectivethereadingofD5.1isrecommended.Chapter 2 will address the general scope behind the e-learning services, their relation to the EVER-ESTinfrastructuresandtargetedusecases.Chapter3willaddressthemaincomponentsofthee-LearningServicesgivingspecialconsiderationtotheuseoftheCloudPlatformDataAgencytoconnecttoexternalEOdatastoragestogetherwithJupytercomponentsandDataCubes.Chapter 4 will address deployment approaches and scalability features for the Notebooks and data cubesconsideringtheuseofDockercontainers.Chapter5willshowtheinitiale-learningservicesimplementedforthisintermediaryversion.
H2020–EINFRA–2015–1 Page8of33
2 EarthSciencee-LearningServices
2.1 ScopeThenewgenerationofin-situandspaceEarthObservationsensorsiscurrentlygeneratinghugeamountsofdatanot easily integrated into processing chains outside the ground segments of space agencies and very largeinstitutions.Theuseofthisdatafore-LearningServicesislimitedtosomedownloadedscenesand,duetothelackofcomputingpowerandstoragecapacitytoexplorethesenewdataflows,itneedsseveralprocessingstepstobecarriedoutbeforethedataisinausableform.Toovercomethislimitation,theEVER-ESTe-LearningServicesmainrequirement is toallowthedevelopmentanddeploymentofvirtual laboratories thatallowtheVRCstoexploreandexecute thee-learningmodules. Theseunitswill containdata resources,executioncode, software librariesanddocumentation,andwillempowerthecommunitiestoexplorethepotentialofEOdataontheirexistingandfutureworkflows.
Figure1–Scopeofthee-LearningService,ComponentsandrespectiveUseCases
The approach followed in EVER-EST takes advantage of the latest developments in Information andCommunication Technology (ICT). It facilitates the handling of large volumes of data and service creation and,most importantly, focusesonmoving theprocessing towhere thedata is, togetherwithnewdataexploitationcapabilities.Theavailabilityoflargedataholdingsaccessibledirectlyfromwebapplicationsprovidesawiderandeasieraccess toEOdataand increasessoftwaresharinganddatadisseminationcapabilitiesbyempoweringtheenduserswithrelevanttechnologies.Bydeliveringinfrastructure,platformorsoftwareasaserviceitispossibletosupport and optimise the use of VRE ICT resources using load balance and provisioning. The EVER-EST CloudPlatform (Figure 1) provides virtual machines on demand from the ICT resources available at the PSNCinfrastructure.Thesearecustomisedforexplicite-learningtasksandprovisionedtobuildvirtuallaboratoriesthat
H2020–EINFRA–2015–1 Page9of33
support users to seamlessly run the courses and their respective modules. The necessary prerequisites arebundledinthepreconfiguredVMwithalltherequiredsoftwareanddataconnectivitycapabilities.ThescopeoftheactivitydescribedinthisdocumentistoimprovetheEOdataaccessine-learningservicesusingtwounifyingtechnologies:DataCubesandWebNotebooks.Datacubesareaneffectivewaytostoreandaccessmulti-dimensionalarraysofvalues,commonlyusedtodescribea timeseriesofdata.For interactivelyexploringdata in aDataCube,WebNotebooksallow theonlineexecutablepresentationof research results immediatelyreproduced, validated and possibly extendable by others. By using these two technologies the objective is todevelop EO e-learning serviceswith rich exploratory data analysis functionality that take full advantage of theever-increasingvolumeofEOinformation.
2.2 OperationalscenariosThissectiondescribestheoperationalplatformscenariosas:
● Administratorofe-learningservices● Developerofe-learningmodules● Managerofe-learningcourses● Participantofe-learningcourses
2.2.1 Administratorofe-learningservicesThisscenariosupportsanAdministrator insetting-uptheaccesstothedataholdingsnecessary increatingdatacubesandthemanagementoftheresourcesallocationtousers.TheServiceAdministratoractivitiesarefocusedonthedataagencycomponents,catalogueanddatagateway,andon cloud management activities. The latter concerns activities like the configuration and monitoring of VMs,deployingthenecessaryapplicationpackagesandmanagingalltheauthorizationlayersandaccessroles.Thedataagency components deal with the preparation of data activities to manage data requests. The cataloguecomponentdiscoversthenecessarydataandthedatagatewaycomponentfacilitatestheaccesstotherequesteddataandmanagesthedifferentdatapoliciesanddataflowoptimization.BothcomponentsofthedataagencywillbefurtherpresentedinChapter3.
Figure2–e-Learningserviceadministratorusecase
H2020–EINFRA–2015–1 Page10of33
2.2.2 Developerofe-learningmodulesThis scenario supports aDeveloper in defining an e-learningmodule including the data holdings selection andvalidationactivity.
Figure3–e-Learningservicemoduledeveloperusecase
The Module Developer activities are focused on the development activities and data agency components,catalogueanddatagateway.Thelatterwillguidethedeveloperinassessingthenecessarydataandtodefinethedatarequirementsoftheapplication.Thedevelopmentactivitiesincludeseveralactivitiesliketherequestofthedata buckets, develop the actual code that will run the application and the validation procedures and will besupportedbytheVMresourcesandtheCloudController.Thedeveloper’sdashboardwillenablethedevelopertocheck the status, deploy or stop the different VM resources used to develop the Notebooks and Data Cubesapplications.BoththesecomponentswillbefurtherpresentedinChapter3.
2.2.3 Managerofe-learningcoursesThisscenariosupportsaManagertosetupane-learningcourseincludingthedefinitionofthecoursemodulesandassignmentofparticipants.Italsoincludesthetasktoassesstheparticipant'sfeedbacktothecoursecontentsandvalue.TheCourseManageractivitiesincludetheselectionandallocationoftheVMresourcesusingtheCloudControllerand,throughthedeveloper’sdashboard,tocheckthestatus,deployorstopthedifferentVMresourcesusedtodeveloptheNotebooksandDataCubesapplications.TheCourseManagerwillalsobeabletocollectthe inputsfromtheCourseParticipantsandassessthecoursepotentialandimprovementpaths.
Figure4–e-Learningmodulemanagerusecase
H2020–EINFRA–2015–1 Page11of33
2.2.4 Participantofe-learningcoursesThisscenariosupportsaParticipanttoattendane-learningcourse,includingtheaccesstodataholdingsandthecapability to interactively execute and test the code. The participantwill also be able to provide feedback andsuggestions.
Figure5-e-Learningcourseparticipantusecase
The participantwill be able to discover the available e-Learning Servicemodules and interactively execute theWebNotebooksandtherequiredDataCubes.TheCourseParticipantwillalsobeable toprovide feedbackandsuggestiononthecoursecontents.
H2020–EINFRA–2015–1 Page12of33
3 Components
3.1 OverviewWhileEOdatasetsarebecomingmoreavailable,sometechnicalchallengesstillremaintoefficientlystore,curateandservesuchdatasets.Furthermore,asapplicationsincreasinglyneedmultipledatasourceswithdifferenttypesof dissemination and exploitation policies, users and developers need support integrate them. Data discovery,access and integration canbe achieved inmultipleways and selecting a proper technology largely dependsonexploitation goals of data repositories and catalogues. To surpass these challenges the EVER-EST e-LearningServiceusesadatamanagement for fast indexingofdatasetmetadatadocumentbroughttogetherbytheDataAgency to support two core technologies: Data Cubes1and Web Notebooks2. The use of these technologiesprovide an easy integration of EO data and a capability to provide a complete set of analysis tools for the e-Learning modules available for the final user. Their web context and the provision of services from bothcomponentsallowparticipantstointeractivelyexecutethecourses.
Figure6-e-LearningServiceArchitecturalDiagram:fromServertoApplication
InthissectionbothcomponentsaredescribedandassessedtogetherwiththeirpotentialtofullysupporttheEOdatalifecyclefromdataaccess,datacleansing,exploration,andreproducibilitytoinformationdissemination.E-learning services must provide common capabilities that allow users to perform data operations likeprocessing/re-processing, projection, visualization or analysis. In addition, theymust be able to train users foreach phase of their research activities, providing, for instance the capability to search data, or extract singleparametersorcombinedproductsfromremoterepositories.Forthisreason,thee-learningmodulemustinterfacewith data management tools that offer easy and seamless access to all relevant repository search and dataretrievaloperationsallowingextractionanddistributionofsingleparametersorcombinedproductsondemand.To facilitate this the JupyterNotebookswill takeadvantageofEOtoolboxes (e.g.GDAL3,SNAP4),accessdata inHDFS,DockerdatabucketsandDataCubesrunningontopofaCloudbasedclusterasshowninFigure6.
1http://www.datacube.org.au/2http://jupyter.org/3GDALisanopensourcetranslatorlibraryforrasterandvectorgeospatialdataformats-http://www.gdal.org/4SNAPisthecommonsoftwareplatformoftheSentinelToolboxes-http://step.esa.int/main/toolboxes/snap/
H2020–EINFRA–2015–1 Page13of33
3.2 DataAgencyTheDataAgency isasetofcomponentsprovidingservices facilitatingdata flow(discoveryandaccess).For thispurpose,itincludesacataloguetostoredatasetmetadataandperformcomplexqueries.Thecatalogueprovidesasearchenginecapableofdealingwithdifferenttypesofqueries(Geographic,Temporal,Textualornumeric)andadistributed OpenSearch interface with diversemetadata search capabilities together with online access pointswithmultipleaccessprotocols.ItprovidesaframeworktosupporteasydiscoveryofEOdata(remotesensingandinsitu),usingbestpracticesforsearchservicessuchasOpenSearchwithGeo,TimeandEOextensionsasdefinedby the CEOS (Committee on Earth Observation Satellites)5allowing standardized and harmonized access tometadataanddataofworld’ssatelliteEarthobservationdataproviders.
Figure7-DataAgencyservicesforfacilitatingthedataflowtoapplications
3.2.1 DataCatalogueThe Data Agency Catalogue is able to store and query the EO product metadata in indexes and provides aninterface for searching the dataset in a catalogue via anOpenSearch interface according to a datamodel. Fordatasetingestion,ittransformsthemetadatafeedfromindexedJSONdocuments.Fordatasetquerying,itexploitsthesearchenginetoretrievethedocumentsinJSONandtransformstheminametadatafeed.Thetransformationand query semantics are defined through plug-ins enabling severalmetadatamodels and feed formats. It usesElasticsearch6,asearchserverbasedonLucene7,thatprovidesadistributed,multitenant-capablefull-textsearchenginewithaRESTinterfaceandschema-freeJSONdocuments.
3.2.2 DataGatewayTheDataAgencyalsocontainsasetofcomponents,calledDataGateway,whichprovideservicestofacilitatedataaccess. This component exposes a data pipe service that provides the bestway to deliver data to the user byfinding thebest locationaccording toparameters suchas theprocessing serviceand location.According to thedatapartnershipapplicable,dataisprovideddirectlyfromtheplatforminfrastructure(mirror)orbyre-routingtheuserdirectlytothedataproviderfacility(Figure7).
5http://ceos.org/ourwork/workinggroups/wgiss/access/opensearch/6Elasticsearch isasearchenginethatprovidesadistributed,multitenant-capablefull-textsearchenginewithanHTTPwebinterfaceandschema-freeJSONdocuments-https://www.elastic.co/7ApacheLuceneisafreeandopen-sourceinformationretrievalsoftwarelibrary-http://lucene.apache.org/
H2020–EINFRA–2015–1 Page14of33
TheDataGatewaybehavesasaData-as-a-Serviceplatformusedtoresolvethebestlocationandprovideaccesstothedatabasedona setof rules. The rule-basedmechanismmanages thedatapartnership, accesspoliciesanddataprocessingscenario.Thisapproachallowsanevolutionofdataresourcestargetsandensuresthelong-termavailabilityofthecurrentandexistingdataresourcesaswellastheadditionofnewones.ThegeneralapproachfortheevolutionofthedataresourcesisbasedonthedevelopmentorconfigurationoftheDataAgencyandDataGateway platform components. The development may include new metadata harvesters, new correlationfunctionsforadvancedsearches(e.g.cloudcoverageforopticaldata)ornewdataaccessfunctions.
3.2.3 DataStoringTo ensure that all the e-Learning Service data requirements are capitalized, the EO data source is directlyprovisioned from Data Gateway using those tools and systematically archived on the PSNC storage byimplementingthreemethodsconnectingdataproviders:
● Remoteaccesseitherbyuserredirectionorbypipingthedatarequestdownload;● CachingthedataresourceonPSNCstorageforadefinedretentiontime;● MirroringthedataresourceonPSNCstorageforanundefinedtimelimit.
Whenapplied,thedatamirroringoccurstoallproducts’typesthatarefetchedandcachedintheinfrastructurewithadjustabletimewindowandcachingpolicy.ADataAgentprovidingallthesystematicstoringcoordinatesthedataaccessandautomaticdataflow.Thiscomponentisinchargeofmonitoringdatasourcesfornewdatasetsbyandperiodicallyharvestingtheexternalcataloguesanddatasources.Allnewdatasetsareautomaticallyingestedinthecataloguetogetherwiththeirlocation.Toread/writedataconcurrentlyfromacloudapplication,technologiessuchasAmazonAWS’ElasticBlockStorage(EBS)diskattachedtoanEC2 instanceareapossiblesolutionandcanbeconfigureddirectly fromthePlatformCloudController.TosimplystorepersistentdataontheEVER-ESTinfrastructure,thePSNCCloudstorageusesS38and data access occurs via a client tool like s3cmd. Applications can make use of the client tool from theirpremises,orfromaVirtualMachineinstance.Nevertheless,S3doesnotallowrandomaccesstofilesanditneveraddspartialobjectstothestoragespace(asuccessresponseofaS3operationmeansthattheentireobjectwasaddedtotheS3bucket),anddoesnotprovideobjectlocking.Alsoifmultiplewriterequestsarereceivedforthesameobjectsimultaneously,onlythelastobjectwrittenwillpersist.Assuch,thes3cmdclientisacommandlinetool available to the users for uploading, retrieving andmanaging data on the PSNC cloud storage. This tool isbasically suited for power users who are familiar with command line programs and for batch scripts andautomatedbackups,butallitscomplexityshouldbehiddenwithinane-LearningModule.TheDeveloperCloudSandboxesserviceonthePlatformisalsomakinguseoftheHadoopDistributedFileSystem(HDFS).EachHadoopSandboxcomeswithaHDFSpartition(typically25GB)complementingtheclassiclocalfilesystem (also sized to 25 GB by default). This setting is the unit processing space at simulation level, that willaggregateandscalewithinacluster.TheApplicationWorkflowoutputsthatneedtopersistfromoneprocessingsteptotheother(fromajobtoanother)mustpublishtotheHDFSpartition,sothatthenextoperationintheDAGcan be assigned its input by Hadoop, tapping into the stack of HDFS data to be processed until all have beenconsumed.StandardoperationsonHDFSareusing the ‘Hadoopdfs’utilityand theDeveloperuses theHadoopSandboxAPIthatprovidesthe‘ciop-publish’and‘ciop-copy’wrappersontopoftheHadoopdfsutility,inordertohandletheseautomateddatamanagementfunctionswithinaHadoopworkflow.Whenconsideredoveraclusterofworkermachines, each having a HDFS partition provisioned from the Hadoop Sandbox template, it delivers‘data locality’ foraworker,whereHadoopwill send thenextprocessingunit (hencemoving code to thedata).Withthisapproach,theimportantthingistomanageappropriatelythestandardoutput(stdout)ofeachHadooptaskinordertopassthemcorrectlyasinputstosubsequentHadooptasks.
8S3isasetofwebservicesinterfacedevelopedbyAmazontostoreandretrievedata.Itisbecomingade-factostandardinCloudsystemsfordataaccess-http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
H2020–EINFRA–2015–1 Page15of33
3.3 WebnotebooksThee-learning servicedeliversawebapplication thatallowplatformusers tocreateandsharedocuments thatcontainlivecode,equations,visualizationsandexplanatorytext.Thetypicalusesforsuchdocumentsinclude:datacleaningandtransformation,numericalsimulation,statisticalmodellingandmachinelearning.TheJupyterNotebookisaninteractivecomputingenvironmentthatenablesuserstoauthornotebookdocumentsthat include: live code, interactive widgets, plots, narrative text, equations, images and video. The JupyterNotebook provides a complete and self-contained record of a computation that can be converted into variousformatsandsharedwithothers.Itcombinesthreecomponents:
1. The Jupyter Notebook web application: An interactive web application for writing and running codeinteractivelyandauthoringnotebookdocuments.
2. TheJupyterKernel:Separateprocessesstartedbythenotebookwebapplicationthatrunsusers’codeinagivenlanguageandreturnsoutputbacktothenotebookwebapplication.Thekernelalsohandlesthingslikecomputationsforinteractivewidgets,tabcompletionandintrospection.
3. TheJupyterNotebookdocuments:Self-containeddocumentsthatcontainarepresentationofallcontentvisibleinthenotebookwebapplication,includinginputsandoutputsofthecomputations,narrativetext,equations, images, and rich media representations of objects. Each notebook document has its ownkernel.
The Notebook web application stores the code, executes it and displays its output together with Markdown9notes, in an editable document. When saved, the result is sent from the browser to the notebook server byHTTP(S),which saves it on disk as a JSON filewith a .ipynb extension (Figure 8). Theweb application, not thekernel, isresponsibleforsavingand loadingnotebooks,so it ispossibletoeditnotebookseven if thekernel forthatlanguageisnotavailable.Thekerneldoesn’tknowanythingaboutthenotebookdocumentitselfasitjustgetscellsofcodetoexecutewhentheuserrunsthem.
Figure8-DisplayingaNotebookfileinthebrowser
3.3.1 JupyternotebookwebapplicationTheJupyternotebookwebapplicationistheGUIwithIDEcapabilitiesofferingapowerful 'scratchpad'paradigmforthecreationandmanagementoflivecomputationaldocumentswithrichmedia.Userscanexecuteblocksofcode (provided by a given kernel) in the browser with automatic syntax highlighting, indentation and tabcompletion/introspection.Firstand foremost, thewebapplication isan interactiveenvironment forwritingandrunningcodedirectlyinanassociatedkernel.SoforexamplethecodebelowwilldisplayasinshowninFigure9:
a=10print(a)
9MarkdownisalightweightmarkuplanguagewithplaintextformattingsyntaxdesignedsothatitcanbeconvertedtoHTMLandmanyotherformats-https://daringfireball.net/projects/markdown/
H2020–EINFRA–2015–1 Page16of33
Figure9-SimpleinteractiveexampleinJupyterNotebooks
Inacodecellitispossibletoeditandwritenewcode,withfullsyntaxhighlightingandtabcompletion.Bydefault,thelanguageassociatedtoacodecellisPython,butdependingonthekernel,otherlanguages,suchasJuliaandR,can be handled interactively.When a code cell is executed, its code is sent to the kernel associatedwith thenotebook and the results that are returned from this computation are displayed in the notebook as the cell’soutput.Theoutputisnotlimitedtotext,withmanyotherformsofoutputalsopossible.Theresultsofthecomputationareattachedtothecodethatgeneratedthemasrichmediarepresentations,suchas HTML, LaTeX10, PNG, SVG, PDF, etc. Besides these rich media representations, users can create and useinteractiveJavaScriptwidgets,whichbindinteractiveuserinterfacecontrolsandvisualizationstoreactivekernelsidecomputations.Alongsidethecode,userscankeepnotesandothertextbychangingthestyleofaNotebookcell from"Code" to"Markdown".Thenotescanbeorganized inahierarchical structurewithdifferent levelsofheadingsandauthoredasnarrativetextusingtheMarkdownmark-uplanguage.The Notebook dashboard is the home page and its main purpose is to display the notebooks and files in thecurrentdirectory.Notebooksandfilescanbeuploadedtothecurrentdirectorybydragginganotebookfileontothe notebook list. The notebook list shows green “Running” text and a green notebook icon next to runningnotebooks(asseenbelow).Notebooksremainrunninguntilexplicitlyshutdown;closingthenotebook’spage isnotsufficient.Toshutdown,delete,duplicateorrenameanotebookthereareanarrayofcontrolsthatwillappearat the topof thenotebook list that alsouse theoperationsondirectories and fileswhenapplicable. Themainfeaturesofthewebapplicationcanbesummarizedas:
● In-browser code editing,with automatic syntax highlighting, indentation, togetherwith tab completionandintrospection;
● Codeexecutingfromthebrowser,withtheresultsofcomputationsattachedtothecodewhichgeneratedthem;
● Displayingtheresultofcomputationusingrichmediarepresentationsincludingpublication-qualityfiguresrenderedbythematplotliblibrarythatcanbeincludedinline;
● In-browsereditingforrichtextusingtheMarkdownmark-uplanguage,whichcanprovidecommentaryforthecode,isnotlimitedtoplaintext;
● The ability to easily include mathematical notation within Markdown cells using LaTeX, and renderednativelybyMathJax.
3.3.2 KernelsThrough Jupyter’skernelandmessagingarchitecture, the Jupyternotebookallowscode tobe run ina rangeofdifferentprogramming languages.Foreachnotebookdocument thatauseropens, thewebapplicationstartsakernel that runs the code for that notebook. Kernels are programming language specific processes that run
10LaTeXisadocumentpreparationsystemforhigh-qualitytypesetting.Itismostoftenusedformedium-to-largetechnicalorscientificdocumentsbutitcanbeusedforalmostanyformofpublishing-https://www.latex-project.org/about/
H2020–EINFRA–2015–1 Page17of33
independently and interact with the Jupyter applications and their user interfaces. Each kernel is capable ofrunningcodeinasingleprogramminglanguageandtherearekernelsavailableinseverallanguages.The“KernelZero”isIPythonanditcomesasadependencyofJupyter.TheIPythonkernelcanbethoughtofasthereferenceimplementationbut thenumberofkernels supportedby Jupyter isgrowing,withother languagesavailable likeJulia,R,Ruby,Haskell,Scala,node.jsandGo11.Thenotebookprovidesasimplewayforuserstopickwhichofthekernelsisusedforagivennotebook.The notebookweb server is written in Python and allows server extensions to bewritten as Pythonmodules.Several popular data science Python libraries are already available like NumPy, SciPy, Matplotlib, Pandas andStatsmodelsandothermoreadvancedlibrariessuchas:
● Scikit-learncontainssimpleandefficienttoolsfordatamininganddataanalysisanditimplementsawidevarietyofmachinelearningalgorithmsandprocessestoconductadvancedanalytics;
● Statsmodelsallowsuserstoexploredata,estimatestatisticalmodels,andperformstatisticaltestswithanextensivelistofdescriptivestatistics,statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeachestimator;
● NLTK allows the development of programs toworkwith human language data. It provides easy-to-useinterfacestoover50corporaandlexicalresourcessuchasWordNet,alongwithasuiteoftextprocessinglibrariesforclassification,tokenization,stemming,tagging,parsing,andsemanticreasoning,andanactivediscussionforum.
3.3.3 JupyternotebookdocumentsAs described in the previous sections, Jupyter notebook documents contain the inputs and outputs of aninteractivesessionaswellasnarrativetext thatsupport thecodebutarenotmeant forexecution.Richoutputgeneratedbyrunningcode,includingHTML,images,video,andplots,isembeddedinthenotebook,whichmakesitacompleteandself-containedrecordofacomputation.Notebookdocumentsarefilesonthelocalfilesystemwith a “.ipynb” extension and allow users to use classical workflows for organizing the Jupyter notebookdocuments into folders or remote repositories to allow sharing these with others. The notebook documentsformatisJSONdatawithbinaryvaluesin“base64”.ThisallowstheJupyternotebookdocumentstobereadandmanipulatedprogrammaticallybyanyprogramminglanguage,andasJSONisatextformat,notebookdocumentsareversioncontrolfriendly.Jupyternotebookdocumentsconsistofalinearsequenceofcells.Therearefourbasiccelltypes:
● Codecells:Inputandoutputoflivecodethatisruninthekernel;● Markdowncells:Narrativetext;● Headingcells:6levelsofhierarchicalorganizationandformatting;● Rawcells:Outputtextthatisincluded,withoutmodification.
TheMarkdowncellsareusedtodocumentthecomputationalprocessinaliterateway,alternatingdescriptivetextwithcode,usingrichtext.TheMarkdownlanguageprovidesasimplewaytoperformthistextmark-up,thatis,tospecify which parts of the text should be emphasized (italics), bold, form lists, etc.When aMarkdown cell isexecuted, the code is converted into the corresponding formatted rich text.Markdown allows arbitrary HTMLcodeforformatting.WithinMarkdowncells,itispossibletoincludemathematicsinastraightforwardway,usingstandard LaTeX notation that are automatically rendered in the HTML output as equations with high qualitytypography. Raw cells provide a place in which the output is written directly and are not evaluated by thenotebook.
11Acompletelistofthesupportedkernelsisavailableat
https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
H2020–EINFRA–2015–1 Page18of33
Jupyternotebookdocumentsavailable fromapublicURLonorGitHubcanbe sharedvia thenbviewer service.ThisserviceloadstheJupyternotebookdocumentandrendersitasastaticwebpage.TheresultingwebpagemaythusbesharedwithotherswithouttheirneedingtoinstalltheJupyterNotebook.The Nbconvert tool in Jupyter converts notebook files to other formats, such as HTML, LaTeX, orreStructuredText12.As shown in Figure10, this conversiongoes througha seriesof stepswherepre-processorsmodify the notebook inmemory (by running the code in the notebook and updates the output), an exporterconvertsthenotebooktoanotherfileformatusingtemplatesandpost-processorsworkonthefileproducedbyexporting. The nbviewer website uses this tool with the HTML exporter. When given a URL, it fetches thenotebookfromthatURL,convertsittoHTML,andservestheHTMLback.
Figure10-Convertinganotebooktootheroutputformats
3.4 DatacubeTosupporttheEVER-ESTe-LearningServiceseffectivelyitisnecessarytoimprovethecollaborativeapproachforstoring,organisingandanalysingthevastquantitiesofsatellite imageryandotherEarthObservationswithnewfunctionalities to create EO products data cubes on-demand. A data cube (or datacube) is amultidimensionalarrayofvalues,commonlyusedtodescribeatimeseriesofimagedata.Thedatacubeisusedtorepresentdataalong somemeasure of interest. Even though it is called a “cube”, it can be 2-dimensional, 3-dimensional, orhigher dimensional. Every dimension represents a new attribute in the database and the cells in the cuberepresent themeasureof interest indifferent temporalandspectraldimensions.Datacubes includeaseriesofstructuresandtoolsthatcalibrateandstandardisedatasets,enablingtheapplicationoftimeseriesandtherapiddevelopment of quantitative information products. By calibrating the information, data cubes make it moreaccessible,easiertoanalyse,andreducetheoverallcostforpilotapplicationandusers.TheDataCubeisasystemdesignedto:
● CataloguelargeamountsofEarthObservationdata;● ProvideaPythonbasedAPIforhighperformancequeryinganddataaccess;● Give scientists and other users the ability to easily perform Exploratory Data Analysis (e.g. combining
multi-sensordataonthesamereferencegridandpixelsize);● Allowscalablecontinentscaleprocessingofthestoreddata;● Tracktheprovenanceofallthecontaineddatatoallowforqualitycontrolandupdates.
12reStructuredTextisaneasy-to-read,what-you-see-is-what-you-getplaintextmarkupsyntaxandparsersystem
http://docutils.sourceforge.net/rst.html
H2020–EINFRA–2015–1 Page19of33
Figure11-EarthObservationDataCubes
TheEVER-EST ServiceDeveloper is providedwith aVMcontaining theopen sourceAustralianGeoscienceDataCube(AGDC)softwarepackage.SupportedbytheDataAgency,thisVMisableto instantiatenewdatacubes inthe Cloud Platform according to the needs of the e-LearningModules. This allows the provision of data cubesdirectlytothemodulesandremovefromthecoursethecomplexitiesregardingdatadiscoveryanddataaccess.ThedeployeddatacubescanthenbedirectlyaccessedfromJupyterNotebookswithAPIstoperformbasicdataqueries and analysis. For example, Figure 12 shows how to access the data from the data cube using the loadfunctionfromthedatacube library.The loadfunctiontakesasargumenttheproducttoaccess, thespatialandtemporalextenttodefinetheexactpartitionofthedatacubethatisrequested.
Figure12-LoadingdatafromthedatacubeinJupyter
The returned data is an array object (e.g. xarray.Dataset) which is a labelled n-dimensional array wrapping aNumPyarray.NumPyisPythonLanguagemainobjectrepresentinghomogeneousmultidimensionalarrayandcanbeuseddirectlyinthenotebook.Withthisinformationitispossibletoinvestigatethedata(Figure13)andtoseethevariables(measurementbands)anddimensionsthatwerereturnedusingthedata_varsdictionary.
H2020–EINFRA–2015–1 Page20of33
Figure13-Retrievingarraydatafromthedatacube
NumPyisanextensiontothePythonprogramminglanguagethataddssupportforlarge,multidimensionalarraysandmatricesthatareidealtooperateindatacubes.Italreadycontainsalargelibraryofhigh-levelmathematicalfunctionstooperateonthesearrays.ThisextensiontriestoovercomePythonslowercodeexecutionbydirectlyprovidingmultidimensionalarraysandfunctionsandoperatorsthatoperateefficientlyonarrays.UsedtogetherwithaplotPython library likematplotlib it isalsopossiblegraphicrepresentationsof thedata in thenotebook.Thislibraryisapython2Dplottinglibrarywhichproducesfiguresthatcanbeusedinpythonscripts.Itsimplifiesthegenerationofplotsandhistogramswithjustafewlinesofcodeandgivingthefullcontroloflinestyles,fontproperties, axes properties, etc., via an object oriented interface. Figure 14 shows how to display compositeimagesdirectly inthenotebookbyloadingthedatafromthedatacube.Theprocedural interfaceisdesignedtocloselyresemblethatofMATLABandmakesmatplotlibeasytolearnforexperiencedMATLABusers,makingitaviablealternativeinEVER-ESTtoMATLABasane-LearningdevelopingtoolforEOdataprocessing.ThecombineduseofPython,NumPy,andmatplotliboverMATLABincludes:
● Python-based, a full-featured modern object-oriented programming language suitable for large-scalesoftwaredevelopment;
● Free,opensource,nolicenseservers;● NativeSVGsupport.
H2020–EINFRA–2015–1 Page21of33
Figure14-Plottingamulti-bandimagefromadatacubeinJupyter
Nevertheless, in its current implementation (version 2), the AGDC software is still only intended as a workingprototypeandnotintendedforoperationaluse.Fortheintermediateversionofthisdocument,theobjectiveoftheworkistoevaluatethefeasibilityofthiscomponenttoprovideacohesive,sustainableframeworkfor large-scale multidimensional data management and access for EO data in the EVER-EST e-Learning Modules. It isintendedtopresentacompletedemonstrationinthefinalversionofthisdocument.
H2020–EINFRA–2015–1 Page22of33
4 DeploymentThe Jupyter notebook web applications are provisioned in a multi-tenant environment and self-contained inDockercontainers.Thiscapacity ismadeoftwocomponents: theJupyterHub,aserverthatgivesmultipleusersaccess to Jupyternotebooks, runningan independent Jupyternotebook server for eachuser and the spawnersthatcontrolhowJupyterHubstartstheindividualnotebookserverforeachuser.
4.1 DataaccessTheJupyternotebookserversanddatacubesaredeployedinDockercontainerswiththedataaccesshappeningwithin thecontainer itself. Inorder tobeable to savedataandsharedatabetweenDocker containers,Dockercame upwith the concept of Docker volumes. These volumes are directories (or files) that are outside of thedefaultUnionFileSystemandexistasnormaldirectoriesandfilesonthehostfilesystem(theUnionFileSystemisacombinationofread-onlylayerswitharead-writelayerontopthatislostwhenthecontainersaredismissed).TheJupyternotebookserversanddatacubesarespawnedinaDockercontainerthatwillaccessdatabymountingaDockervolume.ThedataavailableintheDockervolumeisdictatedbythedatapackagethatoriginatedit.Thedefinition of a data package relies on the data discoverymechanismoffered by theDataAgency. By accessingOpenSearchcataloguesfeaturingtensofdatacollectionsandprovidingadvancedquerymechanismsdrivenbythethematicfacetsofthedata(e.g.interferometricsearchforSARdataorcloudcoverageforopticaldata)theDataAgencybuildsdatapackagescontaining references tooneormorecatalogueelemententries.Oncestored, theelementsreferencewithinagivendatapackagearefetchedfromthearchives(localorremote)andalltogethercreate a Docker volume that ismounted on theDocker container hosting the user notebook server. From thenotebook-anduser-perspective,accessingthedatacontainedintheDockervolumeisdonewithatypicalPOSIXfilesystem,thusprovidinghighthroughput.
4.2 ProvisioningUsersaccessJupyterHubviathewebbrowserastheywoulddowiththeJupyterwebapplicationbygoingtotheaddressof theJupyterHubserver.Usersauthenticateusingthedefinedauthenticator (inourcaseaSingleSign-On) and trigger a new instance of a Jupyter server using the spawner. The approach followed uses theDockerSpawner todeployDockercontainerstoprovideresourcestotheJupyterserver.TheDockerSpawnercanprovidetwotypesofspawners:dockerspawner.DockerSpawner,forspawningidenticalDockercontainersforeachuser and dockerspawner.SystemUserSpawner, for spawning Docker containers with an environment and homedirectoryforeachuser.
4.3 PersistentstorageThe Jupyter notebooks contain live code that can be run repeatedly and the outcomeof these executions caneitherbeinlineresults(andthuscontainedinthenotebook)orphysicalresultsthatarewrittenonthelocalfilesystem.Inthefirstcase-theinlineresults-thepersistenceisguaranteedbyJupyterwhenthenotebookissaved,in the second case, the persistence of the physical results producedmust be addressed by othermeans. In asimilarapproachasforthedatapackages,thepersistenceofthephysicalresultsisachievedbycreatinganotherDockervolumethatisassociatedwiththenotebook.ThissolutionalsoprovidesthepossibilitytosharetheDockervolumeasaninputdatapackagetoanothernotebookthatcanbeownedbyanotheruserwithintheplatform.
H2020–EINFRA–2015–1 Page23of33
4.4 ScalabilityJupyternotebooksoffer interactiveprocessingofdataviaa JupyterWebApplication.While thevertical scaling,which is bynature limitedbyhost capacity, allowsextensionof theprocessing capacityof the resourcesmadeavailable to a user, the horizontal scaling offers support of the execution of the code in Jupyter notebookdocumentsagainstlargearchivesofEarthSciencedata.ThishorizontalscalingofaJupyternotebookistheprocessoftranslatingtheJupyternotebookintoanoperationaltoolforlarge-scaleandcosteffectiveprocessingagainstlargesetsofdata.The horizontal scaling is done by exploiting the Cloud framework offered by the platformwhere YARNplays acentral role. As explained in section 6.4 of deliverable D5.1, YARN provides to the Cloud Production Center, aresource-management platform responsible for managing computing resources in clusters and using them forschedulingofusers'applications. Inparticular, thecapacityofYARNtodeployDockercontainersand theYARNcapacity for supporting several computationalmodels were explained in detail in deliverable D5.1. A new anddedicated computational model will be adopted in YARN to offer the horizontal scaling and thus support thereplication of the single Jupyter notebook processing a batch of input data in several tens, hundreds or eventhousandsofDockercontainerseachprocessingasubsetoftheinputdata.
4.5 AuthenticationThee-LearningservicesmayrequireuserauthenticationattheJupyternotebookserversanddatacubeslevel.TheJupyterHub already provides severalways for users to authenticate via the authenticators layer. This layer is aflexible environment that supports several implementations of the component delivering the mechanism forauthorizingusersfromtheEVE-ESTidentifyprovider.
H2020–EINFRA–2015–1 Page24of33
5 e-LearningCatalogueandPortfolioThe following section provides a snapshot of the existing e-Learningmaterial. The updated and full list of themodulesisavailableonGitHubEVER-EST13.
5.1 Sentinel-1productinformationandmetadataThise-learningmodule targetsa simplehands-on lessononSentinel-1dataand theSentinel-1Toolbox (S1TBX)thatispartoftheSentinelApplicationPlatform(SNAP).TheS1TBXconsistsofacollectionofprocessingtools,dataproductreadersandwritersandadisplayandanalysisapplicationtosupportthelargearchiveofdatafromESASARmissions includingSENTINEL-1,ERS-1&2andENVISAT,aswellas thirdpartySARdata fromALOSPALSAR,TerraSAR-X,COSMO-SkyMedandRADARSAT-2.ThismoduleusesthesnappytoolboxthatprovidestheaccesstotheSNAPJavaAPIfromPython.ThismoduleopensaSentinel-1GRDproductandextractsafewmetadatafieldsandproductinformation.Modulelevel:beginner
5.2 Sentinel-1productsubsetThise-learningmodulealsofocusesonasimplehands-onlessononSentinel-1dataandtheSentinel-1Toolbox.ThismodulealsousessnappytoprocesstheSentinel-1GRDproduct.ThismoduleopensaSentinel-1GRDproductandextractsasubsetproductdefinedwitharadarcoordinateandextent.Thiseasesanysubsequentprocessingtasks(e.g.changedetectionfortheidentificationofafloodextent).
13https://github.com/ec-everest/e-learning-modules
H2020–EINFRA–2015–1 Page25of33
Modulelevel:beginner
H2020–EINFRA–2015–1 Page26of33
5.3 Sentinel-1changedetectionforfloodextentThis e-learningmodule implements a complexworkflow to identify the floodextentusing Sentinel-1data. ThismodulealsousessnappytoprocesstheSentinel-1GRDproduct.The use of SAR satellite imagery for change detection dedicated to flood extentmapping constitutes a viablesolution to process images quickly, providing near real-time flooding information to relief agencies.Moreover,floodextent informationcanbeused fordamageassessmentand riskmanagementcreating scenarios showingpotentialpopulation,economicactivitiesandtheenvironmentatpotentialriskfromflooding.Theworkflowcontainsseveralsteps:
● Step0:Datapreparation-Subset● Step1:Pre-processing-Calibration● Step2:Pre-processing-Specklefiltering● Step3:Binarization● Step4:Post-processing-Geometriccorrection
Modulelevel:advanced
H2020–EINFRA–2015–1 Page27of33
H2020–EINFRA–2015–1 Page28of33
H2020–EINFRA–2015–1 Page29of33
5.4 Sentinel-2vegetationindicesThise-learningmodule implementsaworkflowtoprocessanumberofvegetation indices fromSentinel-2data.ThismodulealsousessnappyandtheS2TBX.Vegetation indicesareaspectral transformationoftwoormorebandsdesignedtoenhancethecontributionofvegetation properties and allow reliable spatial and temporal inter-comparisons of terrestrial photosyntheticactivityandcanopystructuralvariations.Modulelevel:intermediate
H2020–EINFRA–2015–1 Page30of33
H2020–EINFRA–2015–1 Page31of33
H2020–EINFRA–2015–1 Page32of33
H2020–EINFRA–2015–1 Page33of33