Strategy and Tools to support the exploitation model OP-Centric View

Strategy and Tools to support the exploitation modelOP-Centric View

Vito Baggiolini

with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye,

F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others

2

Strategic Goals for 2014

• Enable/motivate OP to do first line diagnosis and find the right expert to call

• Get CO project teams to contribute to first line diagnostic tools

• Prevent failures => less need for 1st line diagnostics(not covered in this presentation)

12/12/2013

3

Outline

• Overview of basic diagnostic use cases

• Tools available and possible improvements for each case

• Tools for more advanced diagnosis

• Further aspects and a summary

12/12/2013

4

Different 1st Line Diagnostic Tools for different Cases

• Case 1: Failures in request/reply systems – operator clicks on a button, expects a result but gets an error or a

timeout– E.g. InCA WorkingSet, LSA Applications, Sequencer, OASIS, …

• Case 2: Alarms in LASER, Red entries in DiaMon• Case 3: Failures in monitoring-based systems

– E.g. Logging system, PostMortem Analysis, Software Interlock, timing, FixedDisplays, …

– Often background processes without GUI– Failure not immediately visible to OP– OP discovers error later in some place not directly related to fault

12/12/2013

Case 1: Failures in Request/Reply systems

6

Case 1: Request Failures seen in GUIs

• Goal: OP should not just call the responsible of the GUI……But identify the system that caused the error and call the responsible expert

• Some integrated diagnostic tools are available (or promised)– WorkingSet/Knob diagnostics– InCA Server diagnostics (promised for April)– “Why-no-permit” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …

12/12/2013

7

InCA Working Set - Diagnostic tools

12/12/2013

8

InCA Working Set - Diagnostic toolbox

12/12/2013

9

InCA Working Set – JAPC/RDA diagnostics

12/12/2013

10

Case 1: Request Failures seen in GUIs

• OP should not just call the responsible of the GUI• But identify the system that caused the error and call the

corresponding expert• Integrated diagnostic tools are available (or promised)

– WorkingSet/Knob diagnostics– InCA Server diagnostics – “Why” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …

• Good error messages can be very useful for OP (c.f. e-logbook)– Bad ones are not even useful for the expert

12/12/2013

11

Examples of Good (because clear) error messages

Exception: [cern.japc.ParameterException] "Server 'BQBBQLHC.cfv-ua47-bqpll' is down or unreachable"

Problem loading the settings for the TCTs: TCTVA.4L2.B1: java.lang.Exception: Rejected, first point of profile different from current position!

asynchronous operation on LHC.OFSU/[email protected] failedcern.japc.ParameterException: Server overloaded and not able to handle all requests, will loose updates.

12/12/2013

12

One long but good (because complete) error message

LHC OP logbook [Tuesday 04-Sep-2012 Morning]2012/09/04 13:45:57 DRIVE ABORT GAP CLEANING SETTING [ERROR] DriveException [Failed to drive some parameters AbortGapCleanVB1/CleaningStrength: cern.cmw.AccessDenied: RBAC Exception: Authorization access denied because there is no matching access rule. Transaction: [SET on AbortGapCleanVB1#CleaningStrength] Token [serial=0x71eb758d; user=lhcop; authTime=2012.09.04 08:04:22; endTime=2012.09.04 16:03:22; application=LHC Sequencer GUI;location=CCC-LHC; locationAddress=172.18.200.124; roles=[DIAMON-Operator, SeqLhcOperator, SeqHwcOperator, SeqSpsOperator, LHC-Operator, OP-Daemon, SPS-Operator]] RDA Info [clientHost=cs-ccr-lsa1.cern.ch; remoteUser=copera; serverName=ALLAbortGapClean.cfv-sr4-adtvb1] RBAC Setup [policy=strict, mode="OPERATIONAL"]

12/12/2013

13

RBAC Error message (cont)

RBAC Access Map excerpt for [SET on AbortGapCleanVB1#CleaningStrength]: Class Property Device Role Application Location Mode Operation ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-864-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-SR4-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * * NON-OPERATIONAL SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-FrontEnds * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-Servers * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-LHC-Piquet * * * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * CCC-LHC * SET

Clearly a team that got too many spurious support requests ;-)

12/12/2013

14

Ambiguous Error Message

LHC OP [Tuesday 04-Sep-2012 Afternoon]20:35:01 - 2012-09-04 20:35:01,171 [JapcExecutor-thread-31] ERROR JapcDispatcherImpl +==> Exception occured for parameter id [HC.BLM.SR5.R_ENERGY_ACQ] name [HC.BLM.SR5.R/SISMonitoring]asynchronous operation on HC.BLM.SR5.R/SISMonitoring@ failed cern.japc.ParameterException: Notification failure. at cern.japc.ext.cmwrda.DataTranslator.createParameterException(DataTranslator.java:269) at cern.japc.ext.cmwrda.RDAReplyHandler.handleError(RDAReplyHandler.java:115) at cern.cmw.rda.client.ReplyAdapter.handleException(ReplyAdapter.java:88) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2622) at cern.cmw.corba.rda.ReplyHandlerPOA._invoke(ReplyHandlerPOA.java:66) at org.jacorb.poa.RequestProcessor.invokeOperation(RequestProcessor.java:297) at org.jacorb.poa.RequestProcessor.process(RequestProcessor.java:591) at org.jacorb.poa.RequestProcessor.run(RequestProcessor.java:734) Caused by: cern.cmw.IOError: [FesaInternalError - -1] Notification failure. at cern.cmw.rda.client.ServerConnection.insertValue(ServerConnection.java:2768) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2610) ... 4 more

FESA or CMW problem?

12/12/2013

15

Improvements for Request/Reply Failures

• Provide good error messages, especially on system boundaries– Short information understandable by OP to pinpoint faulty component– Complete information (e.g. stack trace) for offline analysis by experts, with enough context

information (e.g. arguments that caused the failure)• Enforce API contracts between your system and others

– Check incoming arguments/data from callers (users, clients)– Check results obtained from services (Java servers, FESA device, DB, etc)– Fail early with clear error messages (don’t wait until a NullPointerException)– Good examples: Sequencer (tasks), PostMortem Analysis Framework (modules)

• Actions:– Implement above recommendations– Code reviews of important API contracts– Use Unit tests / testbed also to validate error handling and error messages

– Complain (and file JIRA issues) about non-comprehensible error messages and stack traces

12/12/2013

Case 2: Red entries in DiaMon, Alarms in LASER

17

DiaMon – Diagnostics and Monitoring tool for OP

• Tool made for OP to carry out first-line diagnostics– Monitoring of metrics: OS-level, JMX, CMX– Comparison of metrics with configurable thresholds =>

red/yellow/green– Monitoring of processes defined in transfer.ref:

Green if process is running, red if not– Corrective actions: restart processes, reboot computers– Misc Diagnostics: e.g. ping computers, login to console

12/12/2013

1812/12/2013

Right-click => popup to restart process

1912/12/2013

2012/12/2013

21

DiaMon Improvement/Next steps

• More flexibility for thresholds– Make it easier to adjust thresholds (why not from DiaMon GUI?)– Use dynamic thresholds in some cases, based on past values(?)

• Increase capability: from process monitoring to service monitoring– Problem now: A Process runs (is green in DiaMon) but doesn’t provide

expected service (should be red).• Actions to :

– Review threshold management– System responsibles should provide sanity checks (c.f. later)– Framework providers and system providers should expose JMX/CMX

metrics and thresholds

12/12/2013

2212/12/2013

LASER Console as used in TI

2312/12/2013

Right-click on alarms shows details: responsible, link to documentation, etc.

24

LASER Alarm System

• Alarms– Inform OP of problems that need human intervention – Can be raised by systems (e.g. FESA devices, timing, Accelerator Logging, SIS, …)– Can show DiaMon metrics that exceed Error Threshold (if configured to generate alarm)

• Assessment– Very effective mechanism to inform OP about problems – Include responsible(s) to call and link to documentation of error– OP must invest time to look after alarms– Risk of too many alarms (screen full, cannot see all/new alarms)

• Improvements/Actions– LASER 2 will introduce a procedure whereby OP must approve alarms – Effort needed to clean up alarm configurations (responsible + documentation)– Mode-dependent alarms exist already but must be configured– Sanity check and procedure needed for OP to assess that LASER itself works fine

12/12/2013

Case 3: Failures in monitoring-based systems

2612/12/2013 Timing System Overview

27

“Jvideo” for OP to see if timing works

12/12/2013Alarm in LASER if no timing on Front-Ends

28

CERN Accelerator Logging Service - Architecture Overview

7 Days raw data

>20 Years filtered data

PL/SQL filtered data transfer

Data

Pro

vide

rsDat

a Pe

rsist

enceD

ata

Cons

umer

s

Extraction API

~ 250’000 Signals~ 16 data loading processes~ 5.4 billion records per day~ 275 GB per day 90 TB per year throughput

PL/

SQ

L A

PIf

~ 850’000 signals~ 300 data loading processes~ 4 billion records per day~ 140 GB per day50 TB per year stored

>100 custom applications> 3 million extraction requests per day

>500 registered users

12/12/2013

2912/12/2013

30

General Case of Failures in Monitoring-based Systems

• Examples: – FixedDisplay not updated, device not sending updates, PM data missing, data

missing for SIS permit, … – Any GUI not updating!

• Theoretically the diagnostic algorithm is simple:– Start with the system where the fault was detected– Find all other involved systems – Check each and find out which one is failing

• Reality is challenging:– Need to know or find all involved systems – (LSA/InCA server, DB, Logging processes, Proxies, JMS Brokers, devices, FECs,

Timing system, OS resources, …)– Difficult to diagnose if intermediate systems work

12/12/2013

31

New ACET Runtime Dependency Viewer (alpha)

12/12/2013

Still a lot of work…More in a TC

early next year

32

Checking individual systems if they are OK or not

• Idea: Do the same checks as system expert, but automate and require less domain knowledge

• System-specific sanity checks => OK/WARN/ERROR1. Functional test: Execute typical service request and check result2. Checking relevant metrics3. Checking configuration

• Examples of functional sanity checks (from headless clients)– LSA: load test settings, modify (trim) them, save to DB, revert– FESA: trigger test event, execute RT action, notify CMW client– JMS broker: send message through broker back to you, check delay– RBAC: login as test user and check reply (e.g. token + expected roles)– Database: check access and response time– http://abwww: periodically download file and check contents and delay– …

12/12/2013

33

Checking individual systems if they are OK or not

• Examples of relevant metrics– Liveness counters (e.g. LSA Client requests, Accelerator Logging data

flow, CMW updates to clients, FESA RT actions, Timing events, …)– Error indicators: error/exception rate; number of packets lost; too

frequent Java Garbage collection• Checking configuration

– Comparison between CCDB configuration and reality– Version Number of FESA SW, OS, Drivers, Firmware, HW(?), …

– Based on configuration feedback from FECs to CCDB

12/12/2013

34

Checking individual systems – status and plans

• Current situation– Sanity checks generally accepted as good idea, but very few teams have

implemented them– Pioneers in this area: Accelerator Logging team– Ideas and promises from other teams (FESA, CMW liveness counters)– Infrastructure is ready to be used (DiaMon, JMX, CMX, config feedback)

• Actions– Someone has to drive and coordinate this effort (me?)– System teams have to find time to implement sanity checks– Ultimate goal: sanity checks integrated in DiaMon (green=green,

red=red)

12/12/2013

Tools for Advanced Operators or CO Special Shift Workers

• Easy browsing and correlation of all our logfiles• “Taming” the expert diagnostic tools

36

Easy browsing and correlation of Logfiles

• Idea:– If a system has problems, there should be errors in the log files– If not, there shouldn’t be any

• Current situation: our logfiles need to be improved– Expert information, often difficult to understand– Logfiles are often contain spurious “normal” error messages – Logfile (“tracing”) information from Linux Syslog and FESA/CMW servers is centrally collected

on cs-ccr-tracing. Java not yet.– Splunk allows for advanced searching and correlation

• Improvements/Actions– Introduce logging culture (e.g. don’t tolerate spurious “normal errors”)– Reduce log volume by filtering out endlessly repeated messages– Add Java logging to Splunk– Promote Splunk among CO experts (Link from CCM. Training)

12/12/2013

3712/12/2013

Splunk Log Search Window

38

“Taming” the Expert Diagnostic Tools

• Every team has developed good tools and use them– CMW Admin tool (new functionality in RDA3)– Lemon system monitoring tool– Timing diagnostic tools (dtm_diag, mtg_diag, test_tgm, ctr_test, …)– FESA navigator + Expert diag tool– Java Mission Control (JConsole on steroids)– Driver test suites– Worldfip diagnostic tools– …

• Improvement/Actions: – Make a user-friendly subset of the functionality available for 1st line diagnostics– Important to include feedback from CO special shift workers– Training?

12/12/2013

39

Acceptance tests before accepting to support systems

• Insist that systems are ready for 1st line support– 1st line diagnostic tools and documentation must be available– Sanity checks and metrics + thresholds must be configured in DiaMon– Valid responsibles (egroups) must be registered for computers, controls

processes, LASER alarms, etc– Restarting (wreboot/reboot) must be enabled for OP only if not risky– If restarting is available it must improve the situation– Good error messages in GUIs and/or in logfiles– No spurious error messages (e.g. “normal errors”)

• Formal acceptance procedure to ensure the above criteria are met?• Systems that don’t meet criteria won’t get 1st line support (?)

– No diagnostic attempts will be made– Experts will be called directly at any time12/12/2013

40

Summary

• New exploitation model requires better 1st line diagnostic tools

• A lot of good expert diagnostic tools are available• But only few diagnostic tools are suitable for OP

• A common effort from the CO different projects is necessary to contribute

• 1st line support coverage only for systems that “deserve” it12/12/2013

Documents

Strategy and Tools to support the exploitation model OP-Centric View