Monitoring HTCondor: A common one-stop solution?

Monitoring HTCondor: A commonone-stop solution?Iain Steers - CERN IT

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 2

Introduction

This presentation is going to cover the variousdifferent ways of monitoring a HTCondor pool.

In an effort to solidify the best practices and buildup a community around a common monitoringsolution.

A Brief History of MonitoringHTCondor

We’ve all sat down and thought - ”What about theclis and unix cron?”.

Figure: One Hour Later : Why is RecentDaemonDutyCycle so high?

A Slight Improvement

Autoformat to the rescue!

Figure: A bit better, but only something you want a human to use.

Requests from Management

Figure: SCHEDD COLLECT STATS BY {BY,FOR} aggregate attributesfrom job matching expressions.

The Ganglia IntegrationPush standard and custom metrics into a gangliainstance using the daemon GANGLIAD.

Figure: Ganglia Web Front-end

Ganglia Competition?

GangliaD is great!But what about large sites which already havecentralized monitoring solutions?

• ELK Stack

• Influxdb and Grafana

• sysdig

MetricsD

Todd’s Talk of Lies from Barcelona mentioned anincoming MetricsD.

Same metric configuration language but instead ofjust sending to ganglia, publishes json blobs withthe monitoring samples.

MetricsD Example

Name = "Availability";

Value = int(ifThenElse(IsCritical is undefined,

(RecentDaemonCoreDutyCycle < .95) ||

(FileTransferFileReadLoad_5m < 2.0) ||

(FileTransferFileWriteLoad_1m < 2.0),

!IsCritical));

Desc = "Average availability of CE";

Scale = 100;

Units = "%";

TargetType = "Scheduler";

May 19, 2016 Monitoring HTCondor: A common one-stop solution?10

As a site, CERN already has a centralizedmonitoring set-up, based on ElasticSearch/Kibana.Leaving the old gangliad a bit redundant.

Pull out interesting metrics from classads/jobs usingpython-bindings.

So we’ve found ourselves using a mix of existingsolutions.

Our Main Dashboard

Figure: The main pool health dashboard

Our Draining Dashboard

Figure: Track the multi-core draining of the pool and wasted cpu overtime.

Our Cgroups Monitoring

Figure: Cgroups Monitoring of Jobs

Too Many Dashboards

Too Many Dashboards and different Systems.(ELK/Influx/Grafana/Spark/Jupyter)

#MonitoringSucks

The Problem

This doesn’t seem to be a problem just at CERN.

Looking at the HTCondor community it seemseveryone has done their own thing for monitoring.

Siloing knowledge, implementations and bestpractices.

Conclusion

Could we do better?

Could we bring together a community aroundmonitoring HTCondor?

Get the Python, HTCondor experts and data junkiesin a (virtual-)room together and come up with acommon platform, to really reveal the health of yourpool?

Questions?

Any Questions?

Monitoring HTCondor: A common one-stop solution?

Documents

HTCondor and Workflows: An Introduction HTCondor Week 2013

HTCondor & ARC CEs

Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 A Year of HTCondor Monitoring

What’s new in HTCondor? What’s coming? HTCondor Week 2014 · 3 Release Timeline ›Stable Series HTCondor v8.6.x - introduced Jan 2017 Currently at v8.6.2 (Last year at v8.4.6)

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

HTCondor and BOINC

HTCondor at HEPiX, WLCG and CERN – Status and Outlook · HTCondor at HEPiX, WLCG and CERN – Status and Outlook HTCondor at HEPiX, WLCG and CERN ... Nordic T1 Slurm CC-IN2P3 UGE,

Job Matching, Handling, and Other HTCondor Features · •Check out the Intro to HTCondor talk from HTCondor Week 2017 for more on:-self-checkpointing-automatic hold/release (e.g

Monitoring HTCondor with the BigPanDA monitoring …research.cs.wisc.edu/htcondor/HTCondorWeek2014/presentations/... · Monitoring HTCondor with the BigPanDA monitoring ... More technical

A Year of HTCondor Monitoring - University of Wisconsin ......Motivations For OSG Connect, we want to: o track HTCondor status o discover job profiles of various users to determine

High Throuput Computing HTCondor - Graz University of ...ahi/Computer/HTCondor/htcondor.pdf · High Throuput Computing { HTCondor Andreas Hirczy TU Graz Institut fur Theoretische

HTCondor and macromolecular structure validation

Welcome to HTCondor Week

HTCondor Biochem v1.1 · PDF fileHTCondor@Biochem Using HTCondor on the Biochemistry Computational Cluster v1.2.0 Jean-Yves Sgro

HTCondor Biochem v1.1.0 · PDF fileHTCondor@Biochem Using HTCondor on the Biochemistry Computational Cluster v1.1.0 Jean-Yves Sgro

What’s new in HTCondor? What’s coming? HTCondor Week 2013 · 2 Release Situation › Development Series HTCondor v7.9.6 frozen, in beta test (release candidate for v8.0.0) Series

HTCondor from the user point of view - UCSDTier2 < TWiki · PDF fileWhat is HTCondor • HTCondor is a Workload Management System • i.e. a batch system • Strong points • Fault

glideinWMS Training 2014 - HTCondor Internals

HTCondor and Workflows: An Introduction HTCondor Week … · possible orders of job execution Job A ... › Logs detailed workflow history ... ›Syntax like any other HTCondor config

Local Stop Smoking Services: Service delivery and monitoring