8
Discovering Job Preemptions in the Open Science Grid Zhe Zhang Holland Computing Center Lincoln, Nebraska [email protected] Brian Bockelman Holland Computing Center Lincoln, Nebraska [email protected] Derek Weitzel Holland Computing Center Lincoln, Nebraska [email protected] David Swanson Holland Computing Center Lincoln, Nebraska [email protected] ABSTRACT The Open Science Grid(OSG)[9] is a world-wide computing system which facilitates distributed computing for scientific research. It can distribute a computationally intensive job to geo-distributed clusters and process job’s tasks in parallel. For compute clusters on the OSG, physical resources may be shared between OSG and cluster’s local user-submitted jobs, with local jobs preempting OSG- based ones. As a result, job preemptions occur frequently in OSG, sometimes significantly delaying job completion time. We have collected job data from OSG over a period of more than 80 days. We present an analysis of the data, characterizing the preemption patterns and different types of jobs. Based on obser- vations, we have grouped OSG jobs into 5 categories and analyze the runtime statistics for each category. we further choose different statistical distributions to estimate probability density function of job runtime for different classes. CCS CONCEPTS Computer systems organization Reliability; Availabil- ity; KEYWORDS OSG, Pilot job, Job failure, Preemption, Failure pattern, Tempo- ral locality, Spatial locality, Job runtime, Distribution, Probability Density Function, Estimation ACM Reference Format: Zhe Zhang, Brian Bockelman, Derek Weitzel, and David Swanson. 2018. Dis- covering Job Preemptions in the Open Science Grid. In PEARC ’18: Practice and Experience in Advanced Research Computing, July 22–26, 2018, Pitts- burgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/ 3219104.3229282 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6446-1/18/07. . . $15.00 https://doi.org/10.1145/3219104.3229282 1 INTRODUCTION The OSG is a national cyber-infrastructure consisting of computa- tional resources distributed among tens of independently-run grid sites. The users of the OSG are organized into large “virtual organi- zations" (VOs), each of which utilizes a specific middleware stack to achieve its goals. For smaller groups or individual scientists, the OSG provides a hosted job submission system. To coordinate these user jobs with computational resources, OSG uses HTCondor[14] to perform scheduling and job executions for millions of jobs on the grid resources. A higher-level resource provisioning system, GlideinWMS[11], allocates resources to HTCondor using the pi- lot paradigm. GlideinWMS submits batch jobs to the individual grid-enabled clusters; when scheduled, these pilot jobs launch a HTCondor worker node process that connects to the VO’s cen- tral pool. The resulting pool overlays the disparate resources from across the grid, presenting a relatively homogeneous HTCondor batch system to the individual scientist. The individual submits the scientific payload jobs into the central VO pool. The pilot system is an effective way to manage grid resources, as users aren’t exposed to the vagaries of different cluster queuing policies and VOs have fine-grained control over the resources each individual receives. A pilot job running inside a compute cluster is fairly analogous to a virtual machine running on Amazon’s EC2; for example, it must detect and manage a physical resource for a dedicated payload job. Similarly, like a virtual machine, one must be prepared for the pilot to fail without warning - network cuts or hardware failures being a possible source. However, one failure mode common in the OSG for opportunistic jobs - but relatively uncommon in cloud providers - is batch job preemption. If, for example, a university computing center configures its grid resources to accept jobs both from OSG (pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid jobs. As a result, OSG pilot-jobs are likely to be preempted if there’s a spike of local usage. Unlike Amazon EC2’s “spot pricing,” there’s little indication to the user when a preemption occurred. If any feedback is provided, it is to the provisioning layer - glideinWMS - not the user scheduling layer. Given preemptions are often immediate and without warning, they are often indistinguishable from a network connectivity drop. At low rates, preemption is relatively harmless across the thou- sands or millions of jobs a user may run. At higher rates, preemp- tions may carry severe impacts for the payloads running inside the pilot. For example, once a disconnect or preemption is detected, arXiv:1807.06639v1 [cs.DC] 17 Jul 2018

Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

Discovering Job Preemptions in the Open Science GridZhe Zhang

Holland Computing CenterLincoln, Nebraska

[email protected]

Brian BockelmanHolland Computing Center

Lincoln, [email protected]

Derek WeitzelHolland Computing Center

Lincoln, [email protected]

David SwansonHolland Computing Center

Lincoln, [email protected]

ABSTRACTThe Open Science Grid(OSG)[9] is a world-wide computing systemwhich facilitates distributed computing for scientific research. Itcan distribute a computationally intensive job to geo-distributedclusters and process job’s tasks in parallel. For compute clusterson the OSG, physical resources may be shared between OSG andcluster’s local user-submitted jobs, with local jobs preempting OSG-based ones. As a result, job preemptions occur frequently in OSG,sometimes significantly delaying job completion time.

We have collected job data from OSG over a period of morethan 80 days. We present an analysis of the data, characterizing thepreemption patterns and different types of jobs. Based on obser-vations, we have grouped OSG jobs into 5 categories and analyzethe runtime statistics for each category. we further choose differentstatistical distributions to estimate probability density function ofjob runtime for different classes.

CCS CONCEPTS• Computer systems organization → Reliability; Availabil-ity;

KEYWORDSOSG, Pilot job, Job failure, Preemption, Failure pattern, Tempo-ral locality, Spatial locality, Job runtime, Distribution, ProbabilityDensity Function, Estimation

ACM Reference Format:Zhe Zhang, Brian Bockelman, Derek Weitzel, and David Swanson. 2018. Dis-covering Job Preemptions in the Open Science Grid. In PEARC ’18: Practiceand Experience in Advanced Research Computing, July 22–26, 2018, Pitts-burgh, PA, USA. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3219104.3229282

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, July 22–26, 2018, Pittsburgh, PA, USA© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-6446-1/18/07. . . $15.00https://doi.org/10.1145/3219104.3229282

1 INTRODUCTIONThe OSG is a national cyber-infrastructure consisting of computa-tional resources distributed among tens of independently-run gridsites. The users of the OSG are organized into large “virtual organi-zations" (VOs), each of which utilizes a specific middleware stackto achieve its goals. For smaller groups or individual scientists, theOSG provides a hosted job submission system. To coordinate theseuser jobs with computational resources, OSG uses HTCondor[14]to perform scheduling and job executions for millions of jobs onthe grid resources. A higher-level resource provisioning system,GlideinWMS[11], allocates resources to HTCondor using the pi-lot paradigm. GlideinWMS submits batch jobs to the individualgrid-enabled clusters; when scheduled, these pilot jobs launch aHTCondor worker node process that connects to the VO’s cen-tral pool. The resulting pool overlays the disparate resources fromacross the grid, presenting a relatively homogeneous HTCondorbatch system to the individual scientist. The individual submits thescientific payload jobs into the central VO pool. The pilot system isan effective way to manage grid resources, as users aren’t exposedto the vagaries of different cluster queuing policies and VOs havefine-grained control over the resources each individual receives.

A pilot job running inside a compute cluster is fairly analogousto a virtual machine running on Amazon’s EC2; for example, it mustdetect and manage a physical resource for a dedicated payload job.Similarly, like a virtual machine, one must be prepared for the pilotto fail without warning - network cuts or hardware failures being apossible source. However, one failure mode common in the OSG foropportunistic jobs - but relatively uncommon in cloud providers- is batch job preemption. If, for example, a university computingcenter configures its grid resources to accept jobs both from OSG(pilot jobs) and local researchers, the university might assign higherpriority to the local users and enable preemption on the grid jobs.As a result, OSG pilot-jobs are likely to be preempted if there’s aspike of local usage.

Unlike Amazon EC2’s “spot pricing,” there’s little indication tothe user when a preemption occurred. If any feedback is provided, itis to the provisioning layer - glideinWMS - not the user schedulinglayer. Given preemptions are often immediate and without warning,they are often indistinguishable from a network connectivity drop.

At low rates, preemption is relatively harmless across the thou-sands or millions of jobs a user may run. At higher rates, preemp-tions may carry severe impacts for the payloads running inside thepilot. For example, once a disconnect or preemption is detected,

arX

iv:1

807.

0663

9v1

[cs

.DC

] 1

7 Ju

l 201

8

Page 2: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA Z. Zhang et al.

the OSG payload jobs have to be rescheduled and started over(checkpointing is relatively rare). This can significantly delay jobcompletion time; if the same payload is preempted many times, itcan even prevent forward progress in a workflow.

Job failures have been actively researched for last decades. Theresearchers mainly conducted experiments from two aspects, ei-ther from user-job perspective or system-hardware perspective. Forexample, researchers collected failure event logs from systems[6]and predict failures based on different component such as mem-ory failures, IO failures and network failures. Other interestingresearches[4] collected user-submitted job logs and estimates thefailures from jobs’ return code and other status information. Al-though previous research provides rich set of methodology to dis-cover failures in grids, the pilot-job based infrastructure makes OSGjob failures distinct. Unlike a “white-box” experiment of collectingsystem logs, OSG does not have centralized logging scheme to col-lect hardware failure from all grids. Unlike “black-box” experimentof analyzing user-job logs, one does not have the ability to accessfull user logs from the disparate submit points. Further complicatingthe picture, preemption is typically controlled by site schedulingpolicy: this is often inscrutable from the outside, changes over time,and often impossible to analytically model across the dozens ofsites.

To build a picture of pilot-job failures, we will continuously pollthe HTCondor pool composed of the pilot jobs and build an in-memory model of the system over time. This helps us to conductour experiment as a “grey-box” where we collect virtual systemlogs to analyze failures.

In addition to predicting job failures in distributed grids, esti-mating job runtime is another extensive part of research for lastdecades. Researchers have found that correctly estimating job run-time can help scheduler to make smart decision and reduce theaverage job completion time[15]. Some research[12] has also es-timated job runtime for OSG pilot jobs. However, to the best ofour knowledge, none of the prior work have considered preemp-tions across a set of sites. The uncertainty of occurrence makespreempted jobs’ runtime unpredictable, causing prior models to notfit our case. On the other hand, preemptions themselves have theirown patterns. For example, a university compute resources thatuses desktop cycle scheduling should see a spike of preemptionsat the beginning of the workday. We work to characterizing thepreemptions’ behaviors and design a dedicated model to betterpredict payload runtime.

In this paper, we investigate into preemptions in OSG from thejobs collected over an 80-day period. We characterize common pat-terns of preemptions across OSG clusters while we also enumeratea few unique signatures. We also try several statistical distributionsto estimate probability density function of job runtime on OSG.This work has the following contributions:

(1) We design amodel to interpret OSG jobs’ snapshots to job lifecycles and demonstrate interesting preemption behaviors.

(2) We characterize runtime for different job classes and use sta-tistical distributions to estimate probability density functionof job runtime.

2 BACKGROUND2.1 Pilot SystemsPilot-based systems are widely adopted by grid infrastructures usedfor scientific research[8][2][1]. A reason for the success is thatthey provide a straightforward approach to manage the job-to-sitematchmaking model (“In which site should I queue my payloadjob?") historically used in grid computing that required impossibly-accurate modeling of site queues and reliability. The pilot-basedsystems improve the prior model in two ways: by using late-binding,payload jobs are not assigned to sites until resources are available;second, by decoupling the workload specification from the job ex-ecution. Late-binding solves issues associated with committing apayload to a certain site before knowing how long until resourceare available. Decoupling workload specification and job execu-tion can help payload jobs to select optimal resource cost throughmatchmaking algorithms.

Figure 1 shows a common architecture of a pilot system. Toexecute a task, the Pilot Manager will first query the resource pooland the workload manager to compare available resources andjob demand. As necessary, it may allocate resources by creatinga pilot job. The Pilot Manager can also submit these pilots to aremote resource as a job task (a pilot job itself is a job task). Whenstarted, the pilot executes daemons on batch worker nodes and isresponsible to manage and advertise the resources available forapplication-related tasks (the payload jobs) to run on the resources.After the resources have been successfully obtained and joined tothe resource pool, the Workload Manager will dispatch the payloadjob to the pilot where the pilot’s task manager executes the payloadjob.

2.2 GlideinWMS and HTCondorFor theOSG pilot system studied, the pilotmanager is GlideinWMS[11]and the workload manager is HTCondor. In the OSG, this combi-nation is used to connect over 120 individual computing resourcesacross North America and delivers more than 70 million CPU hoursa month for scientific research[5]. HTCondor is a well-knownwork-loadmanager (implying a large user basewith good documentation),originally arising from desktop scavenging (implying an emphasison reliability in an unreliable resource environment). GlideinWMSprovides a mechanism for building a pilot system based on HTCon-dor – in addition to using HTCondor throughout its internals.

One advantage of this setup is that the standard HTCondordaemons handle the scheduling of user jobs to available resources.For this paper, three pertinent roles inside HTCondor are:

• The schedd daemon that manages a job queue where userscan submit jobs.

• The startd daemon manages a computing resource and exe-cuting any corresponding user jobs. Periodically, it advertisesits status to the central manager.

• On the central manager, the collector daemon collects theadvertised state from all daemons in the pool. The negotia-tor queries the collector for resources and schedds for jobs,and performs matchmaking.

Page 3: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

Discovering Job Preemptions in the Open Science Grid PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA

Figure 1: High-level abstraction of Pilot Systems

In addition to traditional resource attributes (available CPUs,memory), on the OSG the startd will advertise local informationabout the grid site name such as its pilot job ID or site name.

When a payload job is matched to a resource (pilot job) by thecentral manager, the schedd will receive a connection string and acapability string representing a claim for running on the resource.The schedd will contact the startd directly and “activate" the claimto start job execution. The startd will advertise its updated state tothe central manager and keep a TCP connection open between thepilot and schedd to act as a job heartbeat. Thus, one can build anapproximation of the system’s state by solely querying the centralmanager.

The glideinWMS pilot has three phases of its life:• Normal running: During this phase, the pilot behaves likea standard HTCondor worker node except that it will shutdown if it has been idle (no running payload jobs) for morethan 20 minutes.

• Retirement During the phase, the pilot will accept no newpayload jobs and will shut down immediately once there areno running payloads.

• Shutdown At this point, the pilot will shut down immedi-ately, preempting the payload (causing the remote schedd toreschedule it).

The transition from Normal running to Retirement to Shutdownis time-based and determined at pilot job startup.

Some site batch systems configure the pilot to be re-queuedafter preemption, resulting in multiple instantiations of the samesubmitted pilot job. Thus, in the next section, it will be importantto distinguish between the unique pilot job in a site batch systemand the one-or-more instantiations of that job.

3 JOB OVERVIEW ON OSGIn the OSG, the collector is a central resource manager that isresponsible to negotiate user jobs with available resources. Eachpilot job run by the OSG reports periodically (approximately onceevery 5 minutes) to the collector. The report includes pilot jobinformation such as the states, activities, run time and planned stoptime; this information can be queried by any HTCondor client.

3.1 Data CollectionWe sample the status of all pilot jobs in the OSG central manageronce every minute; for each known pilot, the collector returns aClassAd[10], a set of key-expression pairs describing the pilot’s state(similar to a JSON map). For each pilot, we collect the followingattributes from the ad:

• Name: Pilot name. Randomly generated when the pilotstarts.

• State: Current HTCondor startd state; typically “claimed,”“unclaimed,” or “retiring.”.

• Activity: Current activity of the startd; can be thought of asub-state.

• MyCurrentTime: Current Unix timestamp when the col-lector sent the ad back to the client.

• TotalJobRunTime: time, in seconds, since the currentlyrunning payload job started.

• DaemonStartTime: Unix timestampwhen the pilot job wasstarted by the site batch system.

• GLIDEIN_ToRetire: Unix timestamp when the pilot jobwill start retirement mode (will accept no new payloads).

• GLIDEIN_ToDie: Unix timestampwhen pilot will shut down,regardless of whether a payload is running.

• GLIDEIN_Site: Site name where the pilot is running.• GLIDEIN_Entry_Name: a unique name (the “entry point”)describing the combination of the pilot configuration andthe cluster where the pilot was submitted. For example, twopilots at the same site but different batch system queueswould have distinct values for GLIDEIN_Entry_Name.

• GLIDEIN_ResourceName: The human-readable name ofthe cluster where the job was submitted (there are multipleresources per site).

• GLIDEIN_SITEWMS_JobId: The unique pilot job ID as-signed when it is submitted to the site.

Note the GLIDEIN_SiteWMS_JobID is generated by the site batchsystem when the pilot is submitted and is invariant even if the sitepreempts and restarts the job; however, the Name and Daemon-StartTime changes each time the pilot is started.

3.2 Data PreprocessingStarting with the collected snapshots, we aim to produce a time-line of each pilot job. Each pilot job is identified by a unique tuple,

Page 4: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA Z. Zhang et al.

(GLIDEIN_SITEWMS_JobId, GLIDEIN_Entry_Name), which per-sists across multiple pilot preemptions of the same pilot; the poolsnapshots are grouped by this unique identifier. Recorded ClassAdsassociated with the same pilot are sorted in chronological order. Wethen compare each snapshot with its previous snapshot, building atimeline of state transitions. For example, if a pilot job disappearsfrom one snapshot to the next, we consider its execution to havepotentially terminated.

Each time we observe a potential termination, we’d like to toclassify the termination reason based on the prior pilot snapshotand any potential future ones (e.g., pilot was restarted or recoveredfrom a network outage).

Figure 2 shows pilot jobs along time series. There are 8 snapshots.At the first snapshot, Job0, Job1, Job2, Job3 and Job4 are in the OSGcentral manager and considered running. At the Snapshot2, Job5starts running and at the Snapshot3, Job6 starts running. If wegroup jobs in a single snapshot (vertically), we can get the numberof running pilots. On the other hand, if we group snapshots for asingle job(horizontally), we can visualize a timeline of the pilot.

Figure 2: Determining Pilot Timelines based on CollectorSnapshots

Based on the reconstructed pilot timelines, we can infer addi-tional attributes about the pilot. For example, we can estimate theruntime of a single instantiation of a pilot by subtracting the pilot’sMyCurrentTime in the last-seen snapshot by DaemonStartTime.Each instance can be detected by the Name attribute, which changesat instance start. The estimated aggregate runtime for a single pilotis then the sum of all estimated pilot instance runtimes.

3.3 Job LabelingBased on observed and understood pilot job behaviors in OSG, weattempt to classify each potential termination event into one of fivecategories:

• Retire - the pilot was in its Retirement phase and the lastpayload exited.

• Kill - the pilot hit its Shutdown phase, causing it to immedi-ately preempt any payloads and terminate.

• IdleShutDown - the pilot is in the Normal phase but shutsitself down due to an idle timeout (no payload jobs werereceived within a configured time window).

• Preemption: the pilot was killed by the site batch system.Preempted pilots may-or-may-not be restarted in the future

by the site batch system, depending on the site configuration,preemption reason, and pilot age.

• NetworkIssue: the potential termination event was actuallya network disconnect; in this case, the same pilot instancewill appear in a snapshot in the future.

Note the same pilot’s unique identifier can re-appear multipletimes in the case of Preemptions (multiple pilot instances) andNetworkIssues (same pilot instance). In our dataset, we recorded1,962,608 distinct pilot jobs and 2,314,666 pilot job instances. Figure3 shows the distribution of different types of instances. As seen inthe figure preemptions occur frequently and contribute to 30.37%of total job instances.

Figure 3: Job Distribution on OSG

48.64% are Retire jobs. These jobs indicate that OSG resourcesmaintain a healthy state. 10.63% are IdleShutDown jobs and thesejobs indicate resources are recycled due to decreasing workloads.Only 1.34% jobs encounters network issues. These jobs temporarilydisappear during some snapshot but reappear later. NetworkIssuesare different from Preemptions because they keep the same Dae-monStartTime when they reappear in snapshots.

3.4 Pilot Classification over TimeFigure 4 shows the occurrences of different jobs over our observedtime period. As seen in the figure, Preemptions and NetworkIssuesoccurred in a burst pattern. The overall jobs sharply decrease whenPreemptions or NetworkIssues happen. In addition, at some timepoints Preemptions and NetworkIssues happen at the same time.

3.5 Pilot Distribution across Different ClustersFigure 5 shows the job distribution on different clusters. As seen inthe figure, there is a large variance across different clusters. One ofthe reasons is due to the variant of configurations on different clus-ters. For example, some clusters allow preemptions but some clus-ters do not; different clustersmight configure their GLIDEIN_ToRetire,GLIDEIN_ToDie to different values and thus jobs are terminated atdifferent duration.

Not all cluster batch systems are configured to actively pre-emption jobs; we will examine more closely the top sources ofpreemption. However, we expect all clusters to have some mini-mal rate of preemption as other causes of pilot job failure (such as

Page 5: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

Discovering Job Preemptions in the Open Science Grid PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA

Figure 4: Different Pilot Potential Termination Events overObservation Period

hardware failure or a payload consuming too much RAM) will beindistinguishable from preemption and counted as preemption inour model.

Figure 5: Pilot-job Distribution across Different Clusters

3.6 Job DurationFigure 6 shows the cumulative distribution function(CDF) of dif-ferent classes. Preemptions can significantly decrease the pilot jobruntime. As seen in Figure 6, 80% of IdleShutDown pilots run forless than 1 hour. Around 40% of Preemptions also happen within1 hour. Retire pilots appear to have several steps in the CDF. Webelieve those steep increases in CDF are the result of pre-configuredGLIDEIN_ToRetire for pilot jobs. Kill pilots also have similar step-like pattern in CDF. However, comparing to Retire jobs, Kill pilotstend to have even sharper increasing edges. The reason is becausethe pre-configured GLIDEIN_ToDie is more strict for pilot jobs thanGLIDEIN_ToRetire. Once a pilot job reaches GLIDEIN_ToRetire, itcan continue working on its existing payload job until work is fin-ished or pilot job reaches pre-configured GLIDEIN_ToDie. If a pilotjob goes beyond GLIDEIN_ToDie, it will be intermediately killed.This strict time threshold can result in the sharply increasing stepsin CDF. NetworkIssues, as illustrated in the figure, have varyingpilot job lifetimes. As we pointed out before, NetworkIssues do notterminate pilot jobs and those payload jobs may continue to run tocompletion.

Figure 6: Job Duration Distribution of Different Jobs

4 OBSERVATIONS FOR PREEMPTIONS4.1 Preempted jobs are likely to be preempted

againAs we observed over 80 days, pilot preemptions are not rare on theOSG. 26.77% of pilot jobs have encountered at least one preemption.In addition, Preemptions can occur multiple times on the same pilot.In one extreme case, we observed a pilot had been preempted 123times. Figure 7 shows the continuous behavior of preemptions. Theleft figure in Figure 7 shows the ratios of continuously preemptedjobs to the total number of jobs in the OSG. As shown in the figure,there are 26.77% of the pilots have been preempted at least once,4.48% of the pilots have been preempted at least twice and so on.As the same jobs continuously get preempted, they become morelikely to be preempted again. The right table in Figure 7 showsthe percentages of continuous preemptions that occur within thepreceding preempted jobs. As seen in the table, once the pilotsget preempted for the first time, 16.74% of them get preempted

Page 6: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA Z. Zhang et al.

for the second time. Among the pilots that have been preemptedtwice, 39.13% gets preempted for the third time. Among those pilots,63.59% gets preempted for the fourth time and so on.

Figure 7: Distribution of Preemption Times

4.2 Preemptions are correlated with specificgrid resources

Each cluster has its local preemption configuration and local usagepatterns. So far, we have measured the OSG as a black-box andmanually created a classification based on our experience. Withinthe 56 clusters we observed, we group them into three categories:

• Clusters that preempt, but re-queue preempted jobs for an-other execution.

• Clusters that “destructively" preempt; preempted jobs arenot re-run.

• Clusters that do not appear to preempt; if preemptions occur,they are either misclassified or lack enough OSG pilots forus to identify the preemptions.

Figure 8 shows preemption rates for all cluster resources whereany preemptions were detected. To de-identify the clusters, wereplace the human-readable names with an enumeration prefixedwith C. Note the preemption distribution is different from the pilotdistribution shown in Figure 5.

4.3 Preemptions Occur Early in Job RuntimeFigure 9 depicts the CDF of preemptions on cluster C39. Here,preemptions occurs in bursts: over 80% of preemptions happenswithin 1-hour from the pilot instance start.

5 PROBABILITY DENSITY FUNCTIONIn this section, we present probability distribution of different jobclasses. We try to find common characteristics of a specific job classacross different clusters. Our purpose is to estimate the probabilitydensity function of job runtime for a cluster. We run the followingtests on every cluster but only present results about C10 (carrier

Figure 8: Preemptions on Different Grid Resources (Top 10)

Figure 9: Time-to-preemption CDF on C39

most jobs in OSG) due to the space limit of the paper. We also givegeneral summary of other clusters in the end.

To estimate runtime distribution, we use Normal distribution,Uniform distribution, Gamma distribution, Chi-squared distribu-tion, Johnson SU (Unbounded) distribution[7], Johnson SB (Bounded)distribution, Inverted weibull distribution and Exponential weibulldistribution[3]. Normal, Uniform, Gamma and Chi-squared distribu-tions are well known distributions. Johnson SU and SB distributionsbelongs to Johnson’s distribution system which was developed byJohnson in 1949. It is a flexible system of distributions, based onthree families of transformations, that translate an observed, non-normal variate to one conforming to the standard normal distribu-tion. The exponential, logistic, and hyperbolic sine transformationsare used to generate log-normal (SL), unbounded (SU), and bounded(SB) distributions, respectively. Inverted and Exponential Weibulldistributions belongs to weibull distribution which is a popularstatistical distribution that is used to estimate important life char-acteristics of the product such as reliability or probability of failure

Page 7: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

Discovering Job Preemptions in the Open Science Grid PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA

at a specific time, the mean life and the failure rate. We use these8 distributions and try to fit runtime distributions of different jobclasses.

5.1 PDF Estimation on C10Figure 10 shows the job runtime distributions of C10. As seen inthe figure, Johnson SB is the best fit distribution. In top 10 list ofclusters with most preemptions in OSG. 5 clusters can be describedby Johnson SB and 4 clusters are fitted by Inverted Weibull and 1cluster is best fitted by Exponential weibull.

Figure 10: PDF Estimation of Preemptions on C10

Figure 11 shows distributions of MaxRetireTime and MaxKill-Time on C10. As we expected, most of the pre-configured retire andkill time are located at certain time. MaxRetireTimes are locatedaround 15 hours and MaxKillTime are located around 23 hours.

Figure 11: Distribution of MaxRetireTime andMaxKillTimeon C10

Figure 12 and Figure 13 show runtime distribution of Retire jobsand Kill jobs. As seen in two figures, most of Retire jobs’ runtimeare around 15 hours and Kill jobs’ runtime are around 23 hours.

However, comparing to Kill jobs, Retire jobs have longer tails indistribution. This is because Retire jobs have been assigned a “soft-deadline"(MaxRetireTime) and pilot-jobs can continue to run theirexisting pillow-jobs until MaxKillTime reached. In this scenario,MaxKillTime is considered as a “hard-deadline" and all pilot-jobsthat are still running on the cluster should be immediately termi-nated. As a result, Kill jobs’ runtime distribution looks sharper thanRetire jobs.

Figure 12: PDF Estimation of Retire Jobs on C10

We also estimate Retire and Kill jobs on other clusters. All 10clusters withmost Retire jobs and Kill jobs are best fitted by Johnson-family distributions.

Figure 13: PDF Estimation of Kill Jobs on C10

We use same method to estimate PDFs of NetworkIssues andIdleShutDowns. We found NetworkIssues randomly distributedon runtime. IdleShutDowns tend to be distributed towards smallruntime since most of them are terminated at early stage as we

Page 8: Discovering Job Preemptions in the Open Science …(pilot jobs) and local researchers, the university might assign higher priority to the local users and enable preemption on the grid

PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA Z. Zhang et al.

discovered in Section 3. But due to the limited number of these twotypes of jobs, we do not show the results in the paper.

With PDFs of different jobs, we further estimate overall PDF ofall pilot jobs on C10. Figure 14 shows the estimated distribution of5 job classes as well as actual pilot-job distribution.

Figure 14: Estimation of Overall PDF on C10

6 CONCLUSIONS AND FUTUREWORKThis paper presents first results from a dataset that allows for anal-ysis of the glideinWMS/HTCondor-based pilot system used on theOSG. In particular, we take a deep look at patterns of preemption forrunning pilots based on monitoring from the pilots themselves – wedid not rely on “correct" site behavior for our work. The approach isimportant as the plethora of site batch systems and configurations(not necessarily externally queryable) has historically precludeddirect analysis of the system’s behavior.

We illustrated the temporal and spatial properties of preemp-tions on cluster sites and also demonstrated the statistical relationsbetween pre-configured system settings(a pilot’s to-retire time andto-die time) and actual pilot terminations. All these informationtogether show the ability to estimate the pilot runtime. We believeit will assist us in future scheduling of the payload jobs. In additionto maximizing total payload throughput, we believe this could beused to help guarantee progress on payloads by better matchingpayload length with remaining pilot runtime.

Additionally, in the papers[15][13], authors use time-series pre-diction models to job runtime in grids. These models can be gener-alized to different grids. We plan to try these models in OSG andfurther improve our prediction performance.

Finally, we also need to collect more job data from OSG at differ-ent time periods such that we can verify if the runtime distributionson a cluster is stationary across different time periods.

ACKNOWLEDGMENTSThis work was supported by NSF award PHY-1148698, via subawardfrom University of Wisconsin-Madison. This research was done

using resources provided by the Open Science Grid, which is sup-ported by the National Science Foundation and the U.S. Departmentof Energy’s Office of Science. This work was completed utilizingthe Holland Computing Center of the University of Nebraska.

REFERENCES[1] S Bagnasco, L Betev, P Buncic, F Carminati, C Cirstoiu, CGrigoras, AHayrapetyan,

A Harutyunyan, A J Peters, and P Saiz. 2008. AliEn: ALICE environment onthe GRID. Journal of Physics: Conference Series 119, 6 (2008), 062012. http://stacks.iop.org/1742-6596/119/i=6/a=062012

[2] Adrian Casajus, Ricardo Graciani, Stuart Paterson, Andrei Tsaregorodtsev, andthe Lhcb Dirac Team. 2010. DIRAC pilot framework and the DIRAC WorkloadManagement System. Journal of Physics: Conference Series 219, 6 (2010), 062049.http://stacks.iop.org/1742-6596/219/i=6/a=062049

[3] Amit Choudhury. 2005. A Simple Derivation of Moments of the ExponentiatedWeibull Distribution. Metrika: International Journal for Theoretical and AppliedStatistics 62, 1 (2005), 17–22. https://EconPapers.repec.org/RePEc:spr:metrik:v:62:y:2005:i:1:p:17-22

[4] EduardoHuedo, RubÃľn S. Montero, and IgnacioM. Llorente. 2006. Evaluating thereliability of computational grids from the end userâĂŹs point of view. Journalof Systems Architecture 52, 12 (2006), 727 – 736. https://doi.org/10.1016/j.sysarc.2006.04.003

[5] B Jayatilaka, T Levshina, M Rynge, C Sehgal, and M Slyz. 2015. The OSG openfacility: A sharing ecosystem. Journal of Physics: Conference Series 664, 3 (2015),032016. http://stacks.iop.org/1742-6596/664/i=3/a=032016

[6] George Kola, Tevfik Kosar, and Miron Livny. 2005. Faults in Large DistributedSystems and What We Can Do About Them. In Proceedings of the 11th Interna-tional Euro-Par Conference on Parallel Processing (Euro-Par’05). Springer-Verlag,Berlin, Heidelberg, 442–453.

[7] R. McKenna, S. Herbein, A. Moody, T. Gamblin, and M. Taufer. 2016. MachineLearning Predictions of Runtime and IO Traffic on High-End Clusters. In 2016IEEE International Conference on Cluster Computing (CLUSTER). 255–258. https://doi.org/10.1109/CLUSTER.2016.58

[8] P Nilsson, J Caballero, K De, T Maeno, A Stradling, T Wenaus, and the Atlas Col-laboration. 2011. The ATLAS PanDA Pilot in Operation. Journal of Physics:Conference Series 331, 6 (2011), 062040. http://stacks.iop.org/1742-6596/331/i=6/a=062040

[9] Ruth Pordes, Don Petravick, Bill Kramer, Doug Olson, Miron Livny, Alain Roy,Paul Avery, Kent Blackburn, Torre Wenaus, Frank WÃijrthwein, Ian Foster, RobGardner, Mike Wilde, Alan Blatecky, John McGee, and Rob Quick. 2007. Theopen science grid. Journal of Physics: Conference Series 78, 1 (2007), 012057.http://stacks.iop.org/1742-6596/78/i=1/a=012057

[10] R. Raman, M. Livny, and M. Solomon. 1998. Matchmaking: distributed re-source management for high throughput computing. In Proceedings. The Sev-enth International Symposium on High Performance Distributed Computing (Cat.No.98TB100244). 140–146. https://doi.org/10.1109/HPDC.1998.709966

[11] I Sfiligoi. 2008. glideinWMSâĂŤa generic pilot-based workload managementsystem. Journal of Physics: Conference Series 119, 6 (2008), 062044. http://stacks.iop.org/1742-6596/119/i=6/a=062044

[12] I Sfiligoi. 2014. Estimating job runtime for CMS analysis jobs. Journal of Physics:Conference Series 513, 3 (2014), 032087. http://stacks.iop.org/1742-6596/513/i=3/a=032087

[13] Ozan Sonmez, Nezih Yigitbasi, Alexandru Iosup, and Dick Epema. 2009. Trace-based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids.In Proceedings of the 18th ACM International Symposium on High PerformanceDistributed Computing (HPDC ’09). ACM, New York, NY, USA, 111–120. https://doi.org/10.1145/1551609.1551632

[14] Douglas Thain, Todd Tannenbaum, andMiron Livny. 2005. Distributed computingin practice: the Condor experience. Concurrency - Practice and Experience 17, 2-4(2005), 323–356.

[15] Dan Tsafrir, Yoav Etsion, and Dror G. Feitelson. 2007. Backfilling Using System-Generated Predictions Rather Than User Runtime Estimates. IEEE Trans. ParallelDistrib. Syst. 18, 6 (June 2007), 789–803. https://doi.org/10.1109/TPDS.2007.70606