19

Monitoring HTCondor: A common one-stop solution?

  • Upload
    lengoc

  • View
    221

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Monitoring HTCondor: A common one-stop solution?
Page 2: Monitoring HTCondor: A common one-stop solution?

Monitoring HTCondor: A commonone-stop solution?Iain Steers - CERN IT

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 2

Page 3: Monitoring HTCondor: A common one-stop solution?

Introduction

This presentation is going to cover the variousdifferent ways of monitoring a HTCondor pool.

In an effort to solidify the best practices and buildup a community around a common monitoringsolution.

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 3

Page 4: Monitoring HTCondor: A common one-stop solution?

A Brief History of MonitoringHTCondor

We’ve all sat down and thought - ”What about theclis and unix cron?”.

Figure: One Hour Later : Why is RecentDaemonDutyCycle so high?

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 4

Page 5: Monitoring HTCondor: A common one-stop solution?

A Slight Improvement

Autoformat to the rescue!

Figure: A bit better, but only something you want a human to use.

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 5

Page 6: Monitoring HTCondor: A common one-stop solution?

Requests from Management

Figure: SCHEDD COLLECT STATS BY {BY,FOR} aggregate attributesfrom job matching expressions.

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 6

Page 7: Monitoring HTCondor: A common one-stop solution?

The Ganglia IntegrationPush standard and custom metrics into a gangliainstance using the daemon GANGLIAD.

Figure: Ganglia Web Front-end

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 7

Page 8: Monitoring HTCondor: A common one-stop solution?

Ganglia Competition?

GangliaD is great!But what about large sites which already havecentralized monitoring solutions?

• ELK Stack

• Influxdb and Grafana

• sysdig

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 8

Page 9: Monitoring HTCondor: A common one-stop solution?

MetricsD

Todd’s Talk of Lies from Barcelona mentioned anincoming MetricsD.

Same metric configuration language but instead ofjust sending to ganglia, publishes json blobs withthe monitoring samples.

May 19, 2016 Monitoring HTCondor: A common one-stop solution? 9

Page 10: Monitoring HTCondor: A common one-stop solution?

MetricsD Example

[

Name = "Availability";

Value = int(ifThenElse(IsCritical is undefined,

(RecentDaemonCoreDutyCycle < .95) ||

(FileTransferFileReadLoad_5m < 2.0) ||

(FileTransferFileWriteLoad_1m < 2.0),

!IsCritical));

Desc = "Average availability of CE";

Scale = 100;

Units = "%";

TargetType = "Scheduler";

]

May 19, 2016 Monitoring HTCondor: A common one-stop solution?10

Page 11: Monitoring HTCondor: A common one-stop solution?

CERN

As a site, CERN already has a centralizedmonitoring set-up, based on ElasticSearch/Kibana.Leaving the old gangliad a bit redundant.

Pull out interesting metrics from classads/jobs usingpython-bindings.

So we’ve found ourselves using a mix of existingsolutions.

May 19, 2016 Monitoring HTCondor: A common one-stop solution?11

Page 12: Monitoring HTCondor: A common one-stop solution?

Our Main Dashboard

Figure: The main pool health dashboard

May 19, 2016 Monitoring HTCondor: A common one-stop solution?12

Page 13: Monitoring HTCondor: A common one-stop solution?

Our Draining Dashboard

Figure: Track the multi-core draining of the pool and wasted cpu overtime.

May 19, 2016 Monitoring HTCondor: A common one-stop solution?13

Page 14: Monitoring HTCondor: A common one-stop solution?

Our Cgroups Monitoring

Figure: Cgroups Monitoring of Jobs

May 19, 2016 Monitoring HTCondor: A common one-stop solution?14

Page 15: Monitoring HTCondor: A common one-stop solution?

Too Many Dashboards

Too Many Dashboards and different Systems.(ELK/Influx/Grafana/Spark/Jupyter)

#MonitoringSucks

May 19, 2016 Monitoring HTCondor: A common one-stop solution?15

Page 16: Monitoring HTCondor: A common one-stop solution?

The Problem

This doesn’t seem to be a problem just at CERN.

Looking at the HTCondor community it seemseveryone has done their own thing for monitoring.

Siloing knowledge, implementations and bestpractices.

May 19, 2016 Monitoring HTCondor: A common one-stop solution?16

Page 17: Monitoring HTCondor: A common one-stop solution?

Conclusion

Could we do better?

Could we bring together a community aroundmonitoring HTCondor?

Get the Python, HTCondor experts and data junkiesin a (virtual-)room together and come up with acommon platform, to really reveal the health of yourpool?

May 19, 2016 Monitoring HTCondor: A common one-stop solution?17

Page 18: Monitoring HTCondor: A common one-stop solution?

Questions?

Any Questions?

May 19, 2016 Monitoring HTCondor: A common one-stop solution?18

Page 19: Monitoring HTCondor: A common one-stop solution?