Upload
lola
View
47
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Strategy and Tools to support the exploitation model OP-Centric View. Vito Baggiolini with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M . Buttner , S . Deghaye, F . Ehm , R . Gorbonosov, G. Kruk, J . Lauener, M . Pace, and others. Strategic Goals for 2014. - PowerPoint PPT Presentation
Citation preview
Strategy and Tools to support the exploitation modelOP-Centric View
Vito Baggiolini
with input from: M. Arruat, J-C. Bau, L. Burdzanowski, M. Buttner, S. Deghaye,
F. Ehm, R. Gorbonosov, G. Kruk, J. Lauener, M. Pace, and others
2
Strategic Goals for 2014
• Enable/motivate OP to do first line diagnosis and find the right expert to call
• Get CO project teams to contribute to first line diagnostic tools
• Prevent failures => less need for 1st line diagnostics(not covered in this presentation)
12/12/2013
3
Outline
• Overview of basic diagnostic use cases
• Tools available and possible improvements for each case
• Tools for more advanced diagnosis
• Further aspects and a summary
12/12/2013
4
Different 1st Line Diagnostic Tools for different Cases
• Case 1: Failures in request/reply systems – operator clicks on a button, expects a result but gets an error or a
timeout– E.g. InCA WorkingSet, LSA Applications, Sequencer, OASIS, …
• Case 2: Alarms in LASER, Red entries in DiaMon• Case 3: Failures in monitoring-based systems
– E.g. Logging system, PostMortem Analysis, Software Interlock, timing, FixedDisplays, …
– Often background processes without GUI– Failure not immediately visible to OP– OP discovers error later in some place not directly related to fault
12/12/2013
Case 1: Failures in Request/Reply systems
6
Case 1: Request Failures seen in GUIs
• Goal: OP should not just call the responsible of the GUI……But identify the system that caused the error and call the responsible expert
• Some integrated diagnostic tools are available (or promised)– WorkingSet/Knob diagnostics– InCA Server diagnostics (promised for April)– “Why-no-permit” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …
12/12/2013
7
InCA Working Set - Diagnostic tools
12/12/2013
8
InCA Working Set - Diagnostic toolbox
12/12/2013
9
InCA Working Set – JAPC/RDA diagnostics
12/12/2013
10
Case 1: Request Failures seen in GUIs
• OP should not just call the responsible of the GUI• But identify the system that caused the error and call the
corresponding expert• Integrated diagnostic tools are available (or promised)
– WorkingSet/Knob diagnostics– InCA Server diagnostics – “Why” functionality in Software Interlock System– JAPC Monitoring Diagnostic GUI– …
• Good error messages can be very useful for OP (c.f. e-logbook)– Bad ones are not even useful for the expert
12/12/2013
11
Examples of Good (because clear) error messages
Exception: [cern.japc.ParameterException] "Server 'BQBBQLHC.cfv-ua47-bqpll' is down or unreachable"
Problem loading the settings for the TCTs: TCTVA.4L2.B1: java.lang.Exception: Rejected, first point of profile different from current position!
asynchronous operation on LHC.OFSU/[email protected] failedcern.japc.ParameterException: Server overloaded and not able to handle all requests, will loose updates.
12/12/2013
12
One long but good (because complete) error message
LHC OP logbook [Tuesday 04-Sep-2012 Morning]2012/09/04 13:45:57 DRIVE ABORT GAP CLEANING SETTING [ERROR] DriveException [Failed to drive some parameters AbortGapCleanVB1/CleaningStrength: cern.cmw.AccessDenied: RBAC Exception: Authorization access denied because there is no matching access rule. Transaction: [SET on AbortGapCleanVB1#CleaningStrength] Token [serial=0x71eb758d; user=lhcop; authTime=2012.09.04 08:04:22; endTime=2012.09.04 16:03:22; application=LHC Sequencer GUI;location=CCC-LHC; locationAddress=172.18.200.124; roles=[DIAMON-Operator, SeqLhcOperator, SeqHwcOperator, SeqSpsOperator, LHC-Operator, OP-Daemon, SPS-Operator]] RDA Info [clientHost=cs-ccr-lsa1.cern.ch; remoteUser=copera; serverName=ALLAbortGapClean.cfv-sr4-adtvb1] RBAC Setup [policy=strict, mode="OPERATIONAL"]
12/12/2013
13
RBAC Error message (cont)
RBAC Access Map excerpt for [SET on AbortGapCleanVB1#CleaningStrength]: Class Property Device Role Application Location Mode Operation ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-864-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * RF-SR4-Control-Room * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * * NON-OPERATIONAL SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-FrontEnds * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 * * RF-Servers * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-LHC-Piquet * * * SET ALLAbortGapClean CleaningStrength AbortGapCleanVB1 RF-Expert * CCC-LHC * SET
Clearly a team that got too many spurious support requests ;-)
12/12/2013
14
Ambiguous Error Message
LHC OP [Tuesday 04-Sep-2012 Afternoon]20:35:01 - 2012-09-04 20:35:01,171 [JapcExecutor-thread-31] ERROR JapcDispatcherImpl +==> Exception occured for parameter id [HC.BLM.SR5.R_ENERGY_ACQ] name [HC.BLM.SR5.R/SISMonitoring]asynchronous operation on HC.BLM.SR5.R/SISMonitoring@ failed cern.japc.ParameterException: Notification failure. at cern.japc.ext.cmwrda.DataTranslator.createParameterException(DataTranslator.java:269) at cern.japc.ext.cmwrda.RDAReplyHandler.handleError(RDAReplyHandler.java:115) at cern.cmw.rda.client.ReplyAdapter.handleException(ReplyAdapter.java:88) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2622) at cern.cmw.corba.rda.ReplyHandlerPOA._invoke(ReplyHandlerPOA.java:66) at org.jacorb.poa.RequestProcessor.invokeOperation(RequestProcessor.java:297) at org.jacorb.poa.RequestProcessor.process(RequestProcessor.java:591) at org.jacorb.poa.RequestProcessor.run(RequestProcessor.java:734) Caused by: cern.cmw.IOError: [FesaInternalError - -1] Notification failure. at cern.cmw.rda.client.ServerConnection.insertValue(ServerConnection.java:2768) at cern.cmw.rda.client.ServerConnection.handleReplyPacket(ServerConnection.java:2610) ... 4 more
FESA or CMW problem?
12/12/2013
15
Improvements for Request/Reply Failures
• Provide good error messages, especially on system boundaries– Short information understandable by OP to pinpoint faulty component– Complete information (e.g. stack trace) for offline analysis by experts, with enough context
information (e.g. arguments that caused the failure)• Enforce API contracts between your system and others
– Check incoming arguments/data from callers (users, clients)– Check results obtained from services (Java servers, FESA device, DB, etc)– Fail early with clear error messages (don’t wait until a NullPointerException)– Good examples: Sequencer (tasks), PostMortem Analysis Framework (modules)
• Actions:– Implement above recommendations– Code reviews of important API contracts– Use Unit tests / testbed also to validate error handling and error messages
– Complain (and file JIRA issues) about non-comprehensible error messages and stack traces
12/12/2013
Case 2: Red entries in DiaMon, Alarms in LASER
17
DiaMon – Diagnostics and Monitoring tool for OP
• Tool made for OP to carry out first-line diagnostics– Monitoring of metrics: OS-level, JMX, CMX– Comparison of metrics with configurable thresholds =>
red/yellow/green– Monitoring of processes defined in transfer.ref:
Green if process is running, red if not– Corrective actions: restart processes, reboot computers– Misc Diagnostics: e.g. ping computers, login to console
12/12/2013
1812/12/2013
Right-click => popup to restart process
1912/12/2013
2012/12/2013
21
DiaMon Improvement/Next steps
• More flexibility for thresholds– Make it easier to adjust thresholds (why not from DiaMon GUI?)– Use dynamic thresholds in some cases, based on past values(?)
• Increase capability: from process monitoring to service monitoring– Problem now: A Process runs (is green in DiaMon) but doesn’t provide
expected service (should be red).• Actions to :
– Review threshold management– System responsibles should provide sanity checks (c.f. later)– Framework providers and system providers should expose JMX/CMX
metrics and thresholds
12/12/2013
2212/12/2013
LASER Console as used in TI
2312/12/2013
Right-click on alarms shows details: responsible, link to documentation, etc.
24
LASER Alarm System
• Alarms– Inform OP of problems that need human intervention – Can be raised by systems (e.g. FESA devices, timing, Accelerator Logging, SIS, …)– Can show DiaMon metrics that exceed Error Threshold (if configured to generate alarm)
• Assessment– Very effective mechanism to inform OP about problems – Include responsible(s) to call and link to documentation of error– OP must invest time to look after alarms– Risk of too many alarms (screen full, cannot see all/new alarms)
• Improvements/Actions– LASER 2 will introduce a procedure whereby OP must approve alarms – Effort needed to clean up alarm configurations (responsible + documentation)– Mode-dependent alarms exist already but must be configured– Sanity check and procedure needed for OP to assess that LASER itself works fine
12/12/2013
Case 3: Failures in monitoring-based systems
2612/12/2013 Timing System Overview
27
“Jvideo” for OP to see if timing works
12/12/2013Alarm in LASER if no timing on Front-Ends
28
CERN Accelerator Logging Service - Architecture Overview
7 Days raw data
>20 Years filtered data
PL/SQL filtered data transfer
Data
Pro
vide
rsDat
a Pe
rsist
enceD
ata
Cons
umer
s
Extraction API
~ 250’000 Signals~ 16 data loading processes~ 5.4 billion records per day~ 275 GB per day 90 TB per year throughput
PL/
SQ
L A
PIf
~ 850’000 signals~ 300 data loading processes~ 4 billion records per day~ 140 GB per day50 TB per year stored
>100 custom applications> 3 million extraction requests per day
>500 registered users
12/12/2013
2912/12/2013
30
General Case of Failures in Monitoring-based Systems
• Examples: – FixedDisplay not updated, device not sending updates, PM data missing, data
missing for SIS permit, … – Any GUI not updating!
• Theoretically the diagnostic algorithm is simple:– Start with the system where the fault was detected– Find all other involved systems – Check each and find out which one is failing
• Reality is challenging:– Need to know or find all involved systems – (LSA/InCA server, DB, Logging processes, Proxies, JMS Brokers, devices, FECs,
Timing system, OS resources, …)– Difficult to diagnose if intermediate systems work
12/12/2013
31
New ACET Runtime Dependency Viewer (alpha)
12/12/2013
Still a lot of work…More in a TC
early next year
32
Checking individual systems if they are OK or not
• Idea: Do the same checks as system expert, but automate and require less domain knowledge
• System-specific sanity checks => OK/WARN/ERROR1. Functional test: Execute typical service request and check result2. Checking relevant metrics3. Checking configuration
• Examples of functional sanity checks (from headless clients)– LSA: load test settings, modify (trim) them, save to DB, revert– FESA: trigger test event, execute RT action, notify CMW client– JMS broker: send message through broker back to you, check delay– RBAC: login as test user and check reply (e.g. token + expected roles)– Database: check access and response time– http://abwww: periodically download file and check contents and delay– …
12/12/2013
33
Checking individual systems if they are OK or not
• Examples of relevant metrics– Liveness counters (e.g. LSA Client requests, Accelerator Logging data
flow, CMW updates to clients, FESA RT actions, Timing events, …)– Error indicators: error/exception rate; number of packets lost; too
frequent Java Garbage collection• Checking configuration
– Comparison between CCDB configuration and reality– Version Number of FESA SW, OS, Drivers, Firmware, HW(?), …
– Based on configuration feedback from FECs to CCDB
12/12/2013
34
Checking individual systems – status and plans
• Current situation– Sanity checks generally accepted as good idea, but very few teams have
implemented them– Pioneers in this area: Accelerator Logging team– Ideas and promises from other teams (FESA, CMW liveness counters)– Infrastructure is ready to be used (DiaMon, JMX, CMX, config feedback)
• Actions– Someone has to drive and coordinate this effort (me?)– System teams have to find time to implement sanity checks– Ultimate goal: sanity checks integrated in DiaMon (green=green,
red=red)
12/12/2013
Tools for Advanced Operators or CO Special Shift Workers
• Easy browsing and correlation of all our logfiles• “Taming” the expert diagnostic tools
36
Easy browsing and correlation of Logfiles
• Idea:– If a system has problems, there should be errors in the log files– If not, there shouldn’t be any
• Current situation: our logfiles need to be improved– Expert information, often difficult to understand– Logfiles are often contain spurious “normal” error messages – Logfile (“tracing”) information from Linux Syslog and FESA/CMW servers is centrally collected
on cs-ccr-tracing. Java not yet.– Splunk allows for advanced searching and correlation
• Improvements/Actions– Introduce logging culture (e.g. don’t tolerate spurious “normal errors”)– Reduce log volume by filtering out endlessly repeated messages– Add Java logging to Splunk– Promote Splunk among CO experts (Link from CCM. Training)
12/12/2013
3712/12/2013
Splunk Log Search Window
38
“Taming” the Expert Diagnostic Tools
• Every team has developed good tools and use them– CMW Admin tool (new functionality in RDA3)– Lemon system monitoring tool– Timing diagnostic tools (dtm_diag, mtg_diag, test_tgm, ctr_test, …)– FESA navigator + Expert diag tool– Java Mission Control (JConsole on steroids)– Driver test suites– Worldfip diagnostic tools– …
• Improvement/Actions: – Make a user-friendly subset of the functionality available for 1st line diagnostics– Important to include feedback from CO special shift workers– Training?
12/12/2013
39
Acceptance tests before accepting to support systems
• Insist that systems are ready for 1st line support– 1st line diagnostic tools and documentation must be available– Sanity checks and metrics + thresholds must be configured in DiaMon– Valid responsibles (egroups) must be registered for computers, controls
processes, LASER alarms, etc– Restarting (wreboot/reboot) must be enabled for OP only if not risky– If restarting is available it must improve the situation– Good error messages in GUIs and/or in logfiles– No spurious error messages (e.g. “normal errors”)
• Formal acceptance procedure to ensure the above criteria are met?• Systems that don’t meet criteria won’t get 1st line support (?)
– No diagnostic attempts will be made– Experts will be called directly at any time12/12/2013
40
Summary
• New exploitation model requires better 1st line diagnostic tools
• A lot of good expert diagnostic tools are available• But only few diagnostic tools are suitable for OP
• A common effort from the CO different projects is necessary to contribute
• 1st line support coverage only for systems that “deserve” it12/12/2013