Using Cloud Bursting to Count Trees and Shrubs in Sub ... · mate Simulation’s (NCCS) use of cloud bursting to count trees and shrubs in Sub-Saharan Africa. We outline the analytic

Using Cloud Bursting to Count Trees and Shrubs in Sub-Saharan Africa

Michael RequaCycle Computing

Email: [email protected] Vaughan

IT Coalition/NASA Goddard Space Flight CenterEmail: [email protected]

John DavidSSAI/NASA Goddard Space Flight Center

Email: [email protected] Cotton

Cycle ComputingEmail: [email protected]

Abstract—This paper describes the NASA Center for Cli-mate Simulation’s (NCCS) use of cloud bursting to count treesand shrubs in Sub-Saharan Africa. We outline the analyticmethodology involved in counting and estimating tree sizefrom satellite imagery. Applying these methods over the entireregion of interest represents almost 75 TB of data and 150,000CPU-hours of computation, a sizable effort. We detail how theNCCS successfully completed this project by utilizing CycleComputing’s software to facilitate the bursting of jobs fromthe NCCS’s on-premise HPC cloud environment to AmazonWeb Services. Two Cycle Computing frameworks were useden route to a solution that enabled large-scale computationand data transfer while providing an end-user experience withseamless access to cloud compute for research.

Keywords-bursting; HPC; climate;

I. INTRODUCTION

As the size of observational datasets grow, and the strainon HPC systems increase, scientists are challenged withfinding new approaches to solve their big data problems.Internal resources are expensive to build and have manycosts outside of the equipment: data center space, power,cooling, and maintenance. Furthermore, internal resourcesare always the wrong size – too large when not fully-utilizedand too small when a larger burst of work is queued.

Using a public cloud service provider (CSP) allows forresources to be dynamically scaled to the appropriate sizefor the moment, and at a predictable cost, but with theintroduction of a different set of challenges. Scaling acompute environment to a much larger size is a specializedskill, and the dynamic nature of the environment is a featurethat many HPC administrators are unused to. One of thechallenges for traditional HPC administrators when movingto a CSP is that it requires devising new ways to scaletheir compute environment, as the scaling workflow theyare used to with their on-premise systems does not directlyapply to CSP systems. Scaling clusters at the NCCS is relianton the fact that administrators have control of the physical

compute nodes and switches. These scaling mechanisms donot directly translate to the cloud, so administrators haveto find new ways to scale in that environment. Moving thedata to the compute also presents significant challenges atthe many-terabytes scale.

At the 2014 AWS Public Sector Summit Intel announcedthey were offering to fund the use of Amazon Web Services(AWS) resources for researchers with innovative applicationsas part of their Head in the Clouds challenge. TsengdarLee, Paul Morin, and Compton Tucker submitted a proposalto use these cloud resources to count all of the trees andshrubs in Sub-Saharan Africa (see Figure 1) in order toestimate biomass and carbon uptake in the region. This wasproposed because it is useful for establishing a baselineof vegetation in the area for future studies, as well asdetermining how much carbon dioxide could be releasedinto the atmosphere if the vegetation in this region wasto die off. The general approach taken to count trees andshrubs in this region can also be applied to different areas ofthe world, giving us these estimates in many more regionsof interest. Another benefit of this proposal was that theintermediate datasets generated by this effort would alsobe widely useful outside of the project. Intel accepted theproposal, creating a collaboration between the NCCS, CycleComputing, Amazon Web Services, and the research teamlead by Paul Morin and Compton Tucker.

II. REASONING FOR CLOUD BURSTING

The NCCS has an on-premise HPC cloud environmentcalled the Advanced Data Analytics Platform (ADAPT)where this research was to initially be conducted. ADAPTwas a desirable system to conduct this research due toits large image data holdings. At the time of this projectADAPT was comprised of around three hundred hypervi-sors, all of which were decommissioned compute nodesfrom the NCCS’ primary, traditional HPC system named

Figure 1. The area of Sub-Saharan Africa analyzed in this study.

Figure 2. TKTK

Figure 3. TKTK

Discover. As Discover would get compute upgrades, a subsetof decommissioned nodes would be brought into ADAPT.Though these compute nodes were still viable, they were nolonger cutting edge, and fell somewhat short of modern dayexpectations of compute nodes in core count and RAM percore. Therefore a single hypervisor would only be capableof hosting two to three VMs each for most projects. TheADAPT system could not be dedicated entirely to thisproject, as it had to meet the requirements of other researchteams. With the lack of available compute power in ADAPT,and the large amount of data needing to be analyzed, cloudbursting was an enticing solution.

This project had two distinct workflows. The first work-flow would prepare the raw image files for processing.There were eleven zones (UTM zones) of interest acrossthe Sub-Saharan region, comprised of images from fourdifferent commercial satellites (WorldView-2, WorldView-3,

Figure 4. An example orthorectification correction applied during thepre-processing of images.

QuickBird-2, and GeoEye-1). For organizational purposes,each UTM zone is broken up into roughly 112 (16 x 7)100km X 100km tiles and further subdivided into 16 (4x 4) 25km x 25km sub-tiles. The initial workflow, calledorthorectified mosaicking, removed the effects of imageperspective and terrain on the raw images, and then stitchedthem together using the best pieces from all images avail-able. This ensured that trees that crossed tile boundaries wereonly counted once, and made overall consistent mosaicsout of the multiple input sources. This initial workflowalone was fairly resource intensive and utilized almost allof the available compute resources within ADAPT. Thesecond workflow would then do some preprocessing of thesecompleted mosaic datasets, and then carry out the countingof trees and shrubs. Since this processing workflow operatedindependently on each of the sub-tiles the entire workloadwas embarrassingly parallel, and lended itself to out-of-coreoptimization and experimentation in burst computing.

Without bursting, the team would have been faced withthe challenge of needing to use the same batch of ADAPTresources for both the mosaicking workflow and the actualcounting of trees and shrubs. The mosaicking of a singleUTM zone, given the resources available in ADAPT, tookup to a week. Once one UTM zone was completed, theteam would then have needed to use the same resources tocarry out the counting of trees and shrubs, instead of movingon to mosaicking the next UTM zone. Since the capabilityof bursting into AWS via Cycle Computing’s utilities wasavailable, the team could mosaic a UTM zone in the on-

Figure 5. An example classification of identified trees. Red dots identifythe center mass of the canopy, yellow dots identify the center mass of theshadow, and gree dots are the endpoints of a transect aligned through thecentroid of the tree.

premise ADAPT system, burst out the workflow of countingtrees and shrubs in the new mosaics out to AWS, and moveon to mosaicking the next UTM zone in the ADAPT system.It was desirable to keep the mosaic processing local inADAPT as it kept AWS compute costs down, and cut downthe amount of data needed to be transferred and kept residentin AWS (incurring cost) roughly tenfold.

Bursting into AWS was fairly cost effective for thisworkflow. The large mosaic datasets would need to betransferred into AWS for the processing, but transferring datainto AWS does not incur any costs. The output of interestfrom the trees and shrubs counting workflow was a fairlysmall CSV file with the location and size estimates of thetrees and shrubs discovered in each sub-tile. This meant thata fairly small amount of data needed to be transferred out ofAWS, which does incur cost. The final architecture that wasdevised using Cycle Computing’s tools would remove theinput (large) data from AWS once each sub-tile was finishedprocessing (one job per sub-tile), which also cut down on thecost of data being resident in AWS. This meant that most ofour AWS charges were for the compute capabilities we wereleveraging, and a smaller amount of cost was incurred bydata transfer out and data at rest in AWS. By utilizing AWSresources in this way the original estimate of 10 monthsof dedicated ADAPT resources was reduced to roughly 1month of total wall clock time.

III. INITIAL ARCHITECTURE OF WORKFLOW

Cycle Computing’s CycleCloud software was used tomanage the AWS environment throughout the project. Cy-cleCloud provides an interface for automatically provi-sioning and configuring cloud instances for computationalworkflows. It dynamically scales the compute resources tomeet current demand and provides tools for transferring databetween internal and cloud storage.

Figure 6. Initial architecture of the cloud environment. This design useda custom transfer listener to launch jobs as file uploads to Amazon S3completed.

The initial AWS environment was designed with a fo-cus on moving data from the internal storage system intoAmazon Simple Storage Service (S3). To analyze 11 UTMzones, 48 terabytes of image data were transferred into S3using a 10 gigabit Direct Connect[1] connection. However,the realized speed was closer to 1.1 Gb/s. The reason forthis reduced performance is not fully understood.

CycleCloud’s Cluster-Init automation tool performed thesubmission of jobs into the HTCondor[2] scheduler by:

1) Gathering the inventory of objects in S32) Processing the file names into a submission template3) Submitting the resulting template into the scheduler

queue

The processing jobs operated fully-independently. Eachjob downloaded the appropriate image file from S3, executedthe image analysis, and uploaded results to S3. Result fileswere downloaded with Cycle Computing’s Pogo file transfertool.

Jobs ran on the c3 family of instances, which use IntelXeon E5-2680 v2 (Ivy Bridge) Processors. C3 instancescome in sizes ranging from two to 32 virtual CPUs, with1.875 GB of RAM per core and 20 GB of local ephemeralstorage per core (except for the smallest (c3.large) whichhas 16 GB of local storage per core).

In order to reduce the cost of computation, we used theSpot Market[3], where instances are available at purchasebelow the market rate but can be lost if the market price ex-ceeds the user-specified bid price. The HTCondor schedulerautomatically re-runs jobs that do not complete, howeverthey restart from the beginning if the job does not havebuilt-in checkpointing.

Overall, this environment worked well for initial process-ing of the images. The overall costs were approximately$3,200 for the compute instances and $750 for the S3storage. Approximately 14,000 jobs total ran on a peak of500 cores. However, with a median job run time of eighthours, ”badput” was an issue. Additionally, the multi-dayresidence of files in S3 increased the total storage costs.

Figure 7. Final architecture of the cloud environment. This design usedCycleCloud with SubmitOnce to burst from the internal ADAPT cluster toa cluster running in Amazon EC2.

IV. FINAL ARCHITECTURE OF WORKFLOW

The lessons of the initial architecture were applied tothe next iteration of the environment, depicted in Figure7. The new architecture used Open Grid Scheduler[4] withCycleCloud’s SubmitOnce technology to make burstingfrom ADAPT to AWS transparent to the end users. Thisnecessitated a change from using S3 for file storage to usinga shared NFS file system.

The pilot project analyzed a single UTM zone. 1700jobs with a median run time of three hours processed fiveterabytes of input. Using a peak cluster size of 250 cores, theanalysis completed in 48 hours. Again using the spot market,the cost was $225 for the instances, with an additional $125for the Elastic Block Storage (EBS)[5] volume used forshared file access, and $150 in file transfer costs between theAvailability Zones used by execute nodes. For future runs,execute nodes were contained to a single Availability Zonein order to eliminate the costs associated with inter-zone datatransfer.

V. FUTURE USE OF ENVIRONMENT

The final architecture used in this project provided theNCCS with a framework that can be utilized by other usersof the ADAPT system. As an outcome of this work, allof the components needed to burst jobs out into the cloudare in place within ADAPT, including a small Open GridScheduler cluster. The steps to get another ADAPT user toleverage this cloud bursting capability go as follows:

1) The user would get their job to run within the localADAPT Grid Engine cluster following a few filesys-tem layout guidelines that will make their job moreeasily burstable into the cloud.

2) User determines job runtime and data requirementswithin ADAPT in order to determine what kind ofresources they need in AWS and estimate AWS costs.

3) The ADAPT team creates a CycleCloud cluster tem-plate for the user’s job. The cluster template will bedifferent for each user’s workflow, as it defines theirAWS instance types, account information, configu-ration management information, and resource limits

(max cluster size) amongst many other things.4) When the user is ready, they launch the head node of

their AWS cluster through ADAPT’s CycleServer in-stance, and then submit their jobs via the SubmitOncetools within ADAPT (this is as simple as replacing theqsub command with SubmitOnce’s csub command).

5) SubmitOnce brokers where the jobs run, either inADAPT or in AWS. If the job lands in AWS, Sub-mitOnce transfers all input and output data betweenAWS and the ADAPT filesystems, making the locationof the job transparent to the user. CycleCloud’s toolswill also dynamically provision and destroy executenodes in the AWS cluster for jobs being submitted,limited by the maximum cluster size defined in theuser’s cluster template.

The NCCS sees this capability as a way to provide usersthe option to drastically speed up their workflows at arelatively small extra cost. The NCCS can provide usersthe resources they need to get their jobs to run as desiredwithin ADAPT, and subsequently give them the ability vastlygrow their compute resources once comfortable with theirjob, bounded only by what they are willing to spend inAWS. This cloud bursting capability also encourages users tooptimize their code as much as possible, as shaving runtimeoff of applications can make a large difference with what ispossible to get done in AWS given a set budget.

VI. CONCLUSION

We have described how the NCCS was able to make useof public cloud resources in order to reduce the time-to-results for a project that requires large amounts of storageand computation. The flexibility cloud services providesallowed for significant changes in the architecture basedon lessons learned from the initial configuration. Using thenewly-created environment will allow NASA researchersto perform on-demand analysis of data using whateverenvironment is most appropriate for the work at hand.

ACKNOWLEDGMENT

The authors would like to thank Intel’s Head in the Cloudsprogram for funding the cloud work.

REFERENCES

[1] “AWS Direct Connect,” https://aws.amazon.com/directconnect/.

[2] “HTCondor,” http://research.cs.wisc.edu/htcondor/.

[3] “Amazon EC2 Spot instances,”https://aws.amazon.com/ec2/spot/.

[4] “Open Grid Scheduler,” http://gridscheduler.sourceforge.net/.

[5] “Amazon Elastic Block Store,” https://aws.amazon.com/ebs/.

Documents

Using Cloud Bursting to Count Trees and Shrubs in Sub ... · mate Simulation’s (NCCS) use of cloud bursting to count trees and shrubs in Sub-Saharan Africa. We outline the analytic