44
glideinWMS training Grid debugging 1 glideinWMS training Solving Grid problems through glidein monitoring i.e. The Grid debugging part of G.Factory operations by Igor Sfiligoi (UCSD)

Solving Grid problems through glidein monitoring

Embed Size (px)

DESCRIPTION

This document presents how Glidein Factory operations help solving problems that develop on Grid resources.

Citation preview

Page 1: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 1

glideinWMS training

Solving Grid problemsthrough glidein monitoring

i.e. The Grid debugging part of G.Factory operations

by Igor Sfiligoi (UCSD)

Page 2: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 2

Glidein Factory Operations

● Factory node operations● Serving VO Frontend Admin requests● Keeping up with changes in the Grid

● Debugging Grid problems● The most time consuming part● Effectively we help solve Grid problems,

through glidein monitoring

Page 3: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 3

Reminder - Glideins

● A glidein is a properly configured Condor startd daemon submitted as a Grid job

Factory

Frontend

CE

Submit node

Central manager

Worker node

glideinMonitorCondor

Requestglideins

Submitglideins

MatchStartd

Job

Page 4: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 4

What can go wrong in the Grid?

● Many places where thing can go wrong● Essentially at any of the arrows below

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 5: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 5

What can go wrong in the Grid?

● In particular● CE may refuse to accept glideins

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 6: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 6

What can go wrong in the Grid?

● In particular● CE may not start glideins● Or fail to tell us what

the status of the job is

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 7: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 7

What can go wrong in the Grid?

● In particular● The worker node may be broken/misconfigured

– Thus validationwill fail

● Manyreasons

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 8: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 8

What can go wrong in the Grid?

● In particular● The WAN networking may not work properly● The CM never hears

from the Startd● Or Schedd

cannot talk to Startd

● Can be selectiveFactory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 9: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 9

What can go wrong in the Grid?

● In particular● Or the security infrastructure could be broken

– CAs missing– Time discrepancies– Etc.

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 10: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 10

What can go wrong in the Grid?

● In particular● The site may refuse to start the user job

– e.g. glexec

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 11: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 11

What can go wrong with glideins?

● And there are also non-Grid problems● Jobs not matching

● But that'sbeyondthe scope of thisdocument

Factory

CE

Submit node

Central manager

Worker node

glidein

Startd

Job

Page 12: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 12

Problem classification

● Most often we see WN problems● Followed by CEs refusing glideins

● Then there are misbehaving CEs● Very hard to diagnose!

● Everything else quite rare● But usually hard to diagnose as well

Typically easyto diagnose

Page 13: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 13

Grid debugging

Validation problemsi.e. Problems on Worker Nodes

Page 14: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 14

WN problems

● The glidein startup script runs a list of validation scripts● If any of them fails, the WN is considered broken● This way user jobs never get to broken WNs

● Two sources of tests● Glidein Factory● VO Frontend

● Of course, if the validation script cannot be fetched from either Web server, it is considered a failure as well

Page 15: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 15

Types of tests

● The glideinWMS SW comes with a set of standard tests (provided by the factory):● Grid environment present (e.g. CAs)● Some free disk on $PWD and on /tmp● Enough FE-provided proxy lifetime remaining● gLExec related tests● OS type

● Each VO may have its own needs, e.g.:● Is VO SW pre-installed and accessible?

Page 16: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 16

Discovering the problems

● Any error message printed out by the validation script will be delivered back to the factory● After the glidein terminates

● Most validation scripts provide clear indication what went wrong● And we strive to get all to do it!

● New machine readable format being introduced● With v2_6_2

Page 17: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 17

Typical ops

● Noticing that a large fraction of glideins for a site are failing is easy● Just look at the monitoring● And we are getting a daily email as well

● Discovering what exactly is broken not too difficult either● Just parse the logs● Will get even easier when all scripts

return machine readable information

With appropriate tools

Page 18: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 18

Action items

● Not much we can do directly● Typically, we open a ticket with the site

● Provide the list of nodes where it happens(rare to have the whole site broken)

● A concise but complete error report essential for a speedy resolution

● In minority of cases we have to contact the VO FE admin, e.g.● Unclear error messages● Non-WN specific validation errors

Unless this is the result of a

misconfigurationon our part

Page 19: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 19

Black hole nodes

● There is one further WM problem● Black hole WNs● WNs that accept glidein jobs, but don't execute them

● glidein_startup never has the chance to log anything● Not even the node it is running on● Thus, empty log files!

● We can infer we have a black hole node at a site by looking at job timing (in Condor-G logs)● Good jobs run for at least 20 mins

Page 20: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 20

Grid debugging

CE refusing the glideins

Page 21: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 21

CE Refusing the glideins

● CE admin has the right to refuse anyone● But usually does not change his mind overnight● First time accessing a site an issue on its own

– Not covered here

● When things go wrong, the typical reason is● CE service down,● Problems in the Security/Auth infrastructure,● CE seriously misconfigured/broken

Page 22: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 22

Expected vs Unexpected

● Some “problems” are expected● e.g. the CE is down for scheduled maintenance● Nothing to do in this case!

– Just a monitoring issue● So, checking the maintenance DB important!

● If not, we have to notify the site● The VO FEs are not getting the CPU slots

they are asking for

Page 23: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 23

Discovering the problem

● Condor-G reacts in two different ways● Does nothing – We still have monitoring showing

the job did not progress from Waiting→Pending● Puts the job on Hold

● The G.Factory will react on Held jobs● Releasing them a few time → Condor-G retries● Removing them after a while

– Just to be replaced with identical glideins

For most non-trivial problemsthe problem does not solve by itself

Page 24: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 24

Action items(for unexpected problems)

● Most of the time, not much we can do directly● Will just open ticket with site● If any useful info in the HoldReason, we pass it on● DN of the proxy the most valuable info

● But it could be our problem, too● Found many Condor-G problems in the past● Comparing the behavior of many G.Factory

instances can confirm or exclude thisAh-hoc solutions neededif this is the case

Page 25: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 25

Grid debugging

CE not properlyhandling the glideins

Page 26: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 26

Problematic CE

● Three basic types of problems:● Glideins not starting● Improper monitoring information● Output files not being delivered to client

● And there is two more● Unexpected policies that kill glideins

Page 27: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 27

Glideins not starting

● The CE scheduling policy is not available to us● So often not obvious if we are just low priority or

something else is going on● GF/Condor-G does not see it as an error condition

● We usually don't act on it, unless● The VO FE admin complains, or● We have been given explicit guidance of the

expected startup rates

● Not much for us to investigate● Just tell the site admin “Jobs are not starting”

Page 28: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 28

Glideins being killed by the site

● Ideally, our glideins should fit within the policies of the site● But sometimes they don't● So they get killed hard

● Discovering this from our side very hard● We often just notice empty log files● Not an error for Condor-G● Often learn of this because the VO complains

● If and when we understand the problem,we can deal with it ourselves● i.e. we config the glideins to stay within the limits

But getting this infois not trivial, remember?

Page 29: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 29

Preemption

● Some site will preempt our glideins if higher priority jobs get into the queue● Effectively killing our glideins

● Not an actual error● Sites have the right to do it!

● But it can mess up with our monitoring/ops● We may see killed glideins, or● We may see glideins that seem to run for

a very long time (when automatically rescheduled on the CE)

● We have to efficiently filter these events out

Page 30: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 30

Improper monitoring info from CE

● A CE may not provide reliable information● Each VO FE provides us with monitoring

information about its central manager● By comparing what it tells us, with what

the CE tells us, we can infer if there are problems● A large, consistent discrepancy typically signals

problems in the CE monitoring● Very difficult to figure out what is going on

● We have no direct detailed data to act upon● Mostly ad-hoc detective work, prodding the black box● Often inconclusive

Page 31: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 31

Lack of output files

● The glidein output files contain● Accounting information● Detail logging

● Without other problems, mostly an annoyance● But much more often paired with glideins failing

● Making failure diagnostics close to impossible● Extremely hard to diagnose the root cause

● Sometimes we may infer it (black holes, killed glideins, ...)● For actual CE problems it requires help from many

parties, including us, the site admins and SW developers

Page 32: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 32

Grid debugging

Networking problems

Page 33: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 33

Glideins are network heavy

● Each glidein opens several long‑lived TCP connections (in CCB mode)● Can overwhelm networking gear

– e.g. NATs can run out of spare ports

● Problems can have non-linear behavior● Will work fine on small scale● Will degrade after a while

– Not necessarily a step function, thoughAlthough straight out

denials due to firewallsare also a problem

Page 34: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 34

Diagnostics and action items

● Not trivial to detect● Errors often in the glidein logs● But difficult to interpret

● Not much we can do directly● A problem between the VO services and the site

– So we notify both

● However● we usually end up assisting as experts

And we are lackingtools for automatically

detecting this.

Page 35: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 35

Grid debugging

Authentication problems

Page 36: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 36

Security is delicate stuff

● Grid security mechanisms paranoid by design● “Availability” is the last to be considered● The main focus is keeping the “bad guys” out

● So they are extremely delicate● If any piece of the chain breaks, everything breaks

● Things that can go wrong (non exhaustive list):● Missing CA(s)

● Expired CRLs● Expired glidein proxy● Wrong system time (clock skew)

Page 37: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 37

Diagnostics and action items

● Finding the root cause usually hard● Errors are in the glidein logs● But usually do not provide enough info

(to avoid giving up too much info to a hypothetical attacker)

● Have to distinguish between site problems and VO problems, too● Only obvious if only a fraction fails (→ WN problem)● Else, may need to get both sides involved to

properly diagnose the root cause

And we are lackingtools for automatically

detecting this.

Page 38: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 38

Grid debugging

Job startup problems

Page 39: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 39

gLExec (1)

● The biggest source of problems, by far,is gLExec refusing to accept a user proxy● Resulting in jobs not starting● BTW, Condor is not good at handling gLExec denials

● We can only partially test gLExec during validation● May behave differently based on the proxy used● Its behavior can change in time

● And final users may be the source of the problem● e.g. by letting the proxy expire Condor could catch

these, and hopefullysoon will

Page 40: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 40

gLExec (2)

● Non trivial to detect● Errors are in the glidein logs● But we miss the tools to extract them

● Finding the root cause impossible without site admin help● gLExec policies are a site secret● We thus just notify the site,

providing the failing user DN

Page 41: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 41

Configuration problems

● Condor can be configured to run a wrapper around the user job● To customize the user environment● Usually provided by the VO FE

● If that fails, the user job fails with it● Luckily, failures are rare

● If we notice them, we notify the VO FE admins● However, they often notice before we do

Page 42: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 42

Other job startup problems

● By default, we validate the node only at glidein startup● WN conditions may change by the time a job

is scheduled to run– e.g. the disk fills up

● The errors are usually only seen by the final users● So we hardly ever notice

these kind of problems

We should do better.Condor supports

periodic validationtests, we just don'tuse them right now.

Page 43: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 43

Summary

● The Grid world is a good approximation of a chaotic system● There are thus many failure modes

● The pilot paradigm hides most of the failures from the final users● But the failures are still there● Resulting in wasted/underused CPU cycles

● The G.Factory operators are in the best position to diagnose the root cause of the failures● By having a global view● However, they cannot solve the problems by themselves

Page 44: Solving Grid problems through glidein monitoring

glideinWMS training Grid debugging 44

Acknowledgments

● This document was sponsored by grants from the US NSF and US DOE,and by the UC system