40
Strategy and Tools to support the exploitation model OP-Centric View Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye, F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others

Strategy and Tools to support the exploitation model OP-Centric View

  • Upload
    lola

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Strategy and Tools to support the exploitation model OP-Centric View. Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M . Buttner , S . Deghaye, F . Ehm , R . Gorbonosov, G. Kruk, J . Lauener, M . Pace, and others. Strategic Goals for 2014. - PowerPoint PPT Presentation

Citation preview

Page 1: Strategy and Tools to support the exploitation model OP-Centric View

Strategy and Tools to support the exploitation modelOP-Centric View

Vito Baggiolini

with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye,

F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others

Page 2: Strategy and Tools to support the exploitation model OP-Centric View

2

Strategic Goals for 2014

• Enable/motivate OP to do first line diagnosis and find the right expert to call

• Get CO project teams to contribute to first line diagnostic tools

• Prevent failures => less need for 1st line diagnostics(not covered in this presentation)

12/12/2013

Page 3: Strategy and Tools to support the exploitation model OP-Centric View

3

Outline

• Overview of basic diagnostic use cases

• Tools available and possible improvements for each case

• Tools for more advanced diagnosis

• Further aspects and a summary

12/12/2013

Page 4: Strategy and Tools to support the exploitation model OP-Centric View

4

Different 1st Line Diagnostic Tools for different Cases

• Case 1: Failures in request/reply systems – operator clicks on a button, expects a result but gets an error or a

timeout– E.g. InCA WorkingSet, LSA Applications, Sequencer, OASIS, …

• Case 2: Alarms in LASER, Red entries in DiaMon• Case 3: Failures in monitoring-based systems

– E.g. Logging system, PostMortem Analysis, Software Interlock, timing, FixedDisplays, …

– Often background processes without GUI– Failure not immediately visible to OP– OP discovers error later in some place not directly related to fault

12/12/2013

Page 5: Strategy and Tools to support the exploitation model OP-Centric View

Case 1: Failures in Request/Reply systems

Page 6: Strategy and Tools to support the exploitation model OP-Centric View

6

Case 1: Request Failures seen in GUIs

• Goal: OP should not just call the responsible of the GUI……But identify the system that caused the error and call the responsible expert

• Some integrated diagnostic tools are available (or promised)– WorkingSet/Knob diagnostics– InCA Server diagnostics (promised for April)– “Why-no-permit” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …

12/12/2013

Page 7: Strategy and Tools to support the exploitation model OP-Centric View

7

InCA Working Set - Diagnostic tools

12/12/2013

Page 8: Strategy and Tools to support the exploitation model OP-Centric View

8

InCA Working Set - Diagnostic toolbox

12/12/2013

Page 9: Strategy and Tools to support the exploitation model OP-Centric View

9

InCA Working Set – JAPC/RDA diagnostics

12/12/2013

Page 10: Strategy and Tools to support the exploitation model OP-Centric View

10

Case 1: Request Failures seen in GUIs

• OP should not just call the responsible of the GUI• But identify the system that caused the error and call the

corresponding expert• Integrated diagnostic tools are available (or promised)

– WorkingSet/Knob diagnostics– InCA Server diagnostics – “Why” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …

• Good error messages can be very useful for OP (c.f. e-logbook)– Bad ones are not even useful for the expert

12/12/2013

Page 11: Strategy and Tools to support the exploitation model OP-Centric View

11

Examples of Good (because clear) error messages

Exception: [cern.japc.ParameterException] "Server 'BQBBQLHC.cfv-ua47-bqpll' is down or unreachable"

Problem loading the settings for the TCTs: TCTVA.4L2.B1: java.lang.Exception: Rejected, first point of profile different from current position!

asynchronous operation on LHC.OFSU/[email protected] failedcern.japc.ParameterException: Server overloaded and not able to handle all requests, will loose updates.

12/12/2013

Page 12: Strategy and Tools to support the exploitation model OP-Centric View

12

One long but good (because complete) error message

LHC OP logbook [Tuesday 04-Sep-2012 Morning]2012/09/04 13:45:57 DRIVE ABORT GAP CLEANING SETTING [ERROR] DriveException [Failed to drive some parameters AbortGapCleanVB1/CleaningStrength: cern.cmw.AccessDenied: RBAC Exception: Authorization access denied because there is no matching access rule. Transaction: [SET on AbortGapCleanVB1#CleaningStrength] Token [serial=0x71eb758d; user=lhcop; authTime=2012.09.04 08:04:22; endTime=2012.09.04 16:03:22; application=LHC Sequencer GUI;location=CCC-LHC; locationAddress=172.18.200.124; roles=[DIAMON-Operator, SeqLhcOperator, SeqHwcOperator, SeqSpsOperator, LHC-Operator, OP-Daemon, SPS-Operator]] RDA Info [clientHost=cs-ccr-lsa1.cern.ch; remoteUser=copera; serverName=ALLAbortGapClean.cfv-sr4-adtvb1] RBAC Setup [policy=strict, mode="OPERATIONAL"]

12/12/2013

Page 13: Strategy and Tools to support the exploitation model OP-Centric View

13

RBAC Error message (cont)

RBAC Access Map excerpt for [SET on AbortGapCleanVB1#CleaningStrength]: Class Property Device Role Application Location Mode Operation ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-864-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-SR4-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * * NON-OPERATIONAL SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-FrontEnds * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-Servers * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-LHC-Piquet * * * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * CCC-LHC * SET

Clearly a team that got too many spurious support requests ;-)

12/12/2013

Page 14: Strategy and Tools to support the exploitation model OP-Centric View

14

Ambiguous Error Message

LHC OP [Tuesday 04-Sep-2012 Afternoon]20:35:01 - 2012-09-04 20:35:01,171 [JapcExecutor-thread-31] ERROR JapcDispatcherImpl +==> Exception occured for parameter id [HC.BLM.SR5.R_ENERGY_ACQ] name [HC.BLM.SR5.R/SISMonitoring]asynchronous operation on HC.BLM.SR5.R/SISMonitoring@ failed cern.japc.ParameterException: Notification failure. at cern.japc.ext.cmwrda.DataTranslator.createParameterException(DataTranslator.java:269) at cern.japc.ext.cmwrda.RDAReplyHandler.handleError(RDAReplyHandler.java:115) at cern.cmw.rda.client.ReplyAdapter.handleException(ReplyAdapter.java:88) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2622) at cern.cmw.corba.rda.ReplyHandlerPOA._invoke(ReplyHandlerPOA.java:66) at org.jacorb.poa.RequestProcessor.invokeOperation(RequestProcessor.java:297) at org.jacorb.poa.RequestProcessor.process(RequestProcessor.java:591) at org.jacorb.poa.RequestProcessor.run(RequestProcessor.java:734) Caused by: cern.cmw.IOError: [FesaInternalError - -1] Notification failure. at cern.cmw.rda.client.ServerConnection.insertValue(ServerConnection.java:2768) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2610) ... 4 more

FESA or CMW problem?

12/12/2013

Page 15: Strategy and Tools to support the exploitation model OP-Centric View

15

Improvements for Request/Reply Failures

• Provide good error messages, especially on system boundaries– Short information understandable by OP to pinpoint faulty component– Complete information (e.g. stack trace) for offline analysis by experts, with enough context

information (e.g. arguments that caused the failure)• Enforce API contracts between your system and others

– Check incoming arguments/data from callers (users, clients)– Check results obtained from services (Java servers, FESA device, DB, etc)– Fail early with clear error messages (don’t wait until a NullPointerException)– Good examples: Sequencer (tasks), PostMortem Analysis Framework (modules)

• Actions:– Implement above recommendations– Code reviews of important API contracts– Use Unit tests / testbed also to validate error handling and error messages

– Complain (and file JIRA issues) about non-comprehensible error messages and stack traces

12/12/2013

Page 16: Strategy and Tools to support the exploitation model OP-Centric View

Case 2: Red entries in DiaMon, Alarms in LASER

Page 17: Strategy and Tools to support the exploitation model OP-Centric View

17

DiaMon – Diagnostics and Monitoring tool for OP

• Tool made for OP to carry out first-line diagnostics– Monitoring of metrics: OS-level, JMX, CMX– Comparison of metrics with configurable thresholds =>

red/yellow/green– Monitoring of processes defined in transfer.ref:

Green if process is running, red if not– Corrective actions: restart processes, reboot computers– Misc Diagnostics: e.g. ping computers, login to console

12/12/2013

Page 18: Strategy and Tools to support the exploitation model OP-Centric View

1812/12/2013

Right-click => popup to restart process

Page 19: Strategy and Tools to support the exploitation model OP-Centric View

1912/12/2013

Page 20: Strategy and Tools to support the exploitation model OP-Centric View

2012/12/2013

Page 21: Strategy and Tools to support the exploitation model OP-Centric View

21

DiaMon Improvement/Next steps

• More flexibility for thresholds– Make it easier to adjust thresholds (why not from DiaMon GUI?)– Use dynamic thresholds in some cases, based on past values(?)

• Increase capability: from process monitoring to service monitoring– Problem now: A Process runs (is green in DiaMon) but doesn’t provide

expected service (should be red).• Actions to :

– Review threshold management– System responsibles should provide sanity checks (c.f. later)– Framework providers and system providers should expose JMX/CMX

metrics and thresholds

12/12/2013

Page 22: Strategy and Tools to support the exploitation model OP-Centric View

2212/12/2013

LASER Console as used in TI

Page 23: Strategy and Tools to support the exploitation model OP-Centric View

2312/12/2013

Right-click on alarms shows details: responsible, link to documentation, etc.

Page 24: Strategy and Tools to support the exploitation model OP-Centric View

24

LASER Alarm System

• Alarms– Inform OP of problems that need human intervention – Can be raised by systems (e.g. FESA devices, timing, Accelerator Logging, SIS, …)– Can show DiaMon metrics that exceed Error Threshold (if configured to generate alarm)

• Assessment– Very effective mechanism to inform OP about problems – Include responsible(s) to call and link to documentation of error– OP must invest time to look after alarms– Risk of too many alarms (screen full, cannot see all/new alarms)

• Improvements/Actions– LASER 2 will introduce a procedure whereby OP must approve alarms – Effort needed to clean up alarm configurations (responsible + documentation)– Mode-dependent alarms exist already but must be configured– Sanity check and procedure needed for OP to assess that LASER itself works fine

12/12/2013

Page 25: Strategy and Tools to support the exploitation model OP-Centric View

Case 3: Failures in monitoring-based systems

Page 26: Strategy and Tools to support the exploitation model OP-Centric View

2612/12/2013 Timing System Overview

Page 27: Strategy and Tools to support the exploitation model OP-Centric View

27

“Jvideo” for OP to see if timing works

12/12/2013Alarm in LASER if no timing on Front-Ends

Page 28: Strategy and Tools to support the exploitation model OP-Centric View

28

CERN Accelerator Logging Service - Architecture Overview

7 Days raw data

>20 Years filtered data

PL/SQL filtered data transfer

Data

Pro

vide

rsDat

a Pe

rsist

enceD

ata

Cons

umer

s

Extraction API

~ 250’000 Signals~ 16 data loading processes~ 5.4 billion records per day~ 275 GB per day 90 TB per year throughput

PL/

SQ

L A

PIf

~ 850’000 signals~ 300 data loading processes~ 4 billion records per day~ 140 GB per day50 TB per year stored

>100 custom applications> 3 million extraction requests per day

>500 registered users

12/12/2013

Page 29: Strategy and Tools to support the exploitation model OP-Centric View

2912/12/2013

Page 30: Strategy and Tools to support the exploitation model OP-Centric View

30

General Case of Failures in Monitoring-based Systems

• Examples: – FixedDisplay not updated, device not sending updates, PM data missing, data

missing for SIS permit, … – Any GUI not updating!

• Theoretically the diagnostic algorithm is simple:– Start with the system where the fault was detected– Find all other involved systems – Check each and find out which one is failing

• Reality is challenging:– Need to know or find all involved systems – (LSA/InCA server, DB, Logging processes, Proxies, JMS Brokers, devices, FECs,

Timing system, OS resources, …)– Difficult to diagnose if intermediate systems work

12/12/2013

Page 31: Strategy and Tools to support the exploitation model OP-Centric View

31

New ACET Runtime Dependency Viewer (alpha)

12/12/2013

Still a lot of work…More in a TC

early next year

Page 32: Strategy and Tools to support the exploitation model OP-Centric View

32

Checking individual systems if they are OK or not

• Idea: Do the same checks as system expert, but automate and require less domain knowledge

• System-specific sanity checks => OK/WARN/ERROR1. Functional test: Execute typical service request and check result2. Checking relevant metrics3. Checking configuration

• Examples of functional sanity checks (from headless clients)– LSA: load test settings, modify (trim) them, save to DB, revert– FESA: trigger test event, execute RT action, notify CMW client– JMS broker: send message through broker back to you, check delay– RBAC: login as test user and check reply (e.g. token + expected roles)– Database: check access and response time– http://abwww: periodically download file and check contents and delay– …

12/12/2013

Page 33: Strategy and Tools to support the exploitation model OP-Centric View

33

Checking individual systems if they are OK or not

• Examples of relevant metrics– Liveness counters (e.g. LSA Client requests, Accelerator Logging data

flow, CMW updates to clients, FESA RT actions, Timing events, …)– Error indicators: error/exception rate; number of packets lost; too

frequent Java Garbage collection• Checking configuration

– Comparison between CCDB configuration and reality– Version Number of FESA SW, OS, Drivers, Firmware, HW(?), …

– Based on configuration feedback from FECs to CCDB

12/12/2013

Page 34: Strategy and Tools to support the exploitation model OP-Centric View

34

Checking individual systems – status and plans

• Current situation– Sanity checks generally accepted as good idea, but very few teams have

implemented them– Pioneers in this area: Accelerator Logging team– Ideas and promises from other teams (FESA, CMW liveness counters)– Infrastructure is ready to be used (DiaMon, JMX, CMX, config feedback)

• Actions– Someone has to drive and coordinate this effort (me?)– System teams have to find time to implement sanity checks– Ultimate goal: sanity checks integrated in DiaMon (green=green,

red=red)

12/12/2013

Page 35: Strategy and Tools to support the exploitation model OP-Centric View

Tools for Advanced Operators or CO Special Shift Workers

• Easy browsing and correlation of all our logfiles• “Taming” the expert diagnostic tools

Page 36: Strategy and Tools to support the exploitation model OP-Centric View

36

Easy browsing and correlation of Logfiles

• Idea:– If a system has problems, there should be errors in the log files– If not, there shouldn’t be any

• Current situation: our logfiles need to be improved– Expert information, often difficult to understand– Logfiles are often contain spurious “normal” error messages – Logfile (“tracing”) information from Linux Syslog and FESA/CMW servers is centrally collected

on cs-ccr-tracing. Java not yet.– Splunk allows for advanced searching and correlation

• Improvements/Actions– Introduce logging culture (e.g. don’t tolerate spurious “normal errors”)– Reduce log volume by filtering out endlessly repeated messages– Add Java logging to Splunk– Promote Splunk among CO experts (Link from CCM. Training)

12/12/2013

Page 37: Strategy and Tools to support the exploitation model OP-Centric View

3712/12/2013

Splunk Log Search Window

Page 38: Strategy and Tools to support the exploitation model OP-Centric View

38

“Taming” the Expert Diagnostic Tools

• Every team has developed good tools and use them– CMW Admin tool (new functionality in RDA3)– Lemon system monitoring tool– Timing diagnostic tools (dtm_diag, mtg_diag, test_tgm, ctr_test, …)– FESA navigator + Expert diag tool– Java Mission Control (JConsole on steroids)– Driver test suites– Worldfip diagnostic tools– …

• Improvement/Actions: – Make a user-friendly subset of the functionality available for 1st line diagnostics– Important to include feedback from CO special shift workers– Training?

12/12/2013

Page 39: Strategy and Tools to support the exploitation model OP-Centric View

39

Acceptance tests before accepting to support systems

• Insist that systems are ready for 1st line support– 1st line diagnostic tools and documentation must be available– Sanity checks and metrics + thresholds must be configured in DiaMon– Valid responsibles (egroups) must be registered for computers, controls

processes, LASER alarms, etc– Restarting (wreboot/reboot) must be enabled for OP only if not risky– If restarting is available it must improve the situation– Good error messages in GUIs and/or in logfiles– No spurious error messages (e.g. “normal errors”)

• Formal acceptance procedure to ensure the above criteria are met?• Systems that don’t meet criteria won’t get 1st line support (?)

– No diagnostic attempts will be made– Experts will be called directly at any time12/12/2013

Page 40: Strategy and Tools to support the exploitation model OP-Centric View

40

Summary

• New exploitation model requires better 1st line diagnostic tools

• A lot of good expert diagnostic tools are available• But only few diagnostic tools are suitable for OP

• A common effort from the CO different projects is necessary to contribute

• 1st line support coverage only for systems that “deserve” it12/12/2013