46
IceCube Live Status and Overview John Jacobsen [email protected] IceCube Live review, August 12, 2010 Review Materials: http://bit.ly/live-review I3Live Website: http://live.icecube.wisc.edu Tuesday, August 10, 2010

Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube LiveStatus and Overview

John [email protected]

IceCube Live review, August 12, 2010

Review Materials: http://bit.ly/live-reviewI3Live Website: http://live.icecube.wisc.edu

Tuesday, August 10, 2010

Page 2: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Background

Tuesday, August 10, 2010

Page 3: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

2005 2006 2007 2008 2009 20102010

“Portlets”

LBNL

Patton,

Stouffer, Day

“Roadrunner”

LBNL

Jacobsen

“Anvil”

LBNL

Patton, Day

“Anvil”

LBNL

Patton, Day

IceCube Live

Jacobsen et al

IceCube Live

Jacobsen et al

IceCube Live

Jacobsen et al

TestDAQ rDAQ pDAQpDAQpDAQpDAQpDAQ

SPADESPADE

PnF Plan BPnF Plan BPnF Plan B PnF Plan APnF Plan A

SN-DAQSN-DAQSN-DAQ

I3Moni

Verif.

ICL TemperaturesICL TemperaturesICL Temperatures

MeteorologyMeteorologyMeteorology

ITSITS

Meteor RadarMeteor Radar

OperationalIntegrated with Experiment

Control

Integrated with Experiment

Control

3

An abbreviated history of Experiment ControlIC9 IC22 IC40 IC59 IC79

Tuesday, August 10, 2010

Page 4: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Challenges

• Remote site: design for low bandwidth• Distributed collaboration: use Web for maximum

accessibility• Low (hu)manpower and shifting requirements:

- leverage existing technologies/services- use Python- deliver code early and improve incrementally (Agile)

• Many, diverse subsystems:- provide plug-in architecture- help people to integrate their own systems

4

Tuesday, August 10, 2010

Page 5: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Requirements

Tuesday, August 10, 2010

Page 6: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Requirements? What requirements?No formal requirements document was ever made.... but the overall “vision” was sketched out early on:

6

https://docushare.icecube.wisc.edu/dsweb/Get/Document-45721/

Tuesday, August 10, 2010

Page 7: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

OK. There are some basic, high-level requirements:

• Show status of the detector, at Pole and NH:- current status, in near real time- historical status

• Allow operation of DAQ, including flashers• Allow operation of other subsystems• Alert operators when problems occur• Maximize uptime • Provide public interface for outreach (proposed

early on, now actively sought)• Do all the above securely

7

Tuesday, August 10, 2010

Page 8: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

1. Display Detector Status and History

1.a. Show current status in LiveView (!Status" page)1.a.i. Show current run status, including DAQ state, run number,

time elapsed in run, run configuration, DAQ release, event count, and number of active DOMs.

1.a.ii. Show current !light mode" (dark / LID)1.b. Provide command line tools for checking current status1.b.i. !livecmd check" shows current run, subrun, run configuration,

run start time, number of DAQ events, DAQ release, current light mode, and status of stand-alone components

1.c. Show detector history in LiveView1.c.i. Graph quantities of interest on !Status" page, over the past 24

hours: event rate, number of active DOMs, outside ambient temperature, CPU load on main experiment control machine, and deviation of current rate

1.c.ii. Provide !Recent" page showing detector run history1.c.ii.1. Present overview of recent runs in tabular format, showing run

number, start time, duration, trigger rate, “zombie” DOMs, run

configuration, pDAQ release, and light mode (LID/dark)

1.c.ii.2. Allow users to specify start and end times of run list (default is one

week), back to the beginning of 2007 (IC22)

1.c.ii.3. Show detector live time for selected runs

1.c.ii.4. Show missing runs and dead time (e.g., 27 minutes), highlighted as a

colored gap

1.c.ii.5. Allow operators to add and delete comments on individual runs. Run

comments are synchronized immediately between hemispheres via

ITS. Only operators can delete comments. !icecube" user can view

but cannot add comments.

1.c.ii.6. Interface with DAQ logs transmitted to the Data Warehouse to show

detailed DAQ log information for all available runs; link to DAQ

monitoring/logging files (with file sizes), Verification plots (with

abbreviated summary shown) and I3Moni analysis pages.

1.c.iii. Show global event rate history (!Rates" page)1.c.iii.1. Show three different time scales (last two hours, last 24 hours, last 60

days)

1.c.iv. Show detector uptime (!Uptime" page)1.c.iv.1. Provide year-by-year graphs of monthly uptimes going

back to 2007 (IC22)1.c.v. Provide alert history1.c.v.1. Show “ticker-like” graphical history snapshot over 24 hours on main

(status) page

1.c.v.2. Show detailed alert history on separate !alerts" page

1.c.v.3. Show complete triggering history of each individual alert on its own

page, as well as who is to be notified by email when the alert is

triggered and whether the alert sends pages to the Winter-Overs

1.c.vi. Provide paginated, low-level view of individual monitored quantities

1.c.vi.1. Implement search by service, variable name, variable type, priority;

allow wildcards for service and variable name

1.c.vii. Provide paginated view of log messages sent by LiveControl as well as pDAQ, SNDAQ and other subsystems, color-coded by subsystem

1.d. Show current UTC time on all pages2. Control Data Acquisition and Stand-alone Systems

2.a. Enable control of detector via LiveView or !livecmd"2.a.i. Start/stop runs, specifying duration, number of runs, run

configuration, filtering options, and one or more optional run comments

2.a.ii. Multiple DAQs (!synchronous components") supported (e.g. TWR, Radio, ...)

2.a.iii. Scriptable interface via !livecmd," with options to block (wait) until run finishes (used for DAQ testing and flasher operation)

2.a.iv. Perform single run, a specified number of runs, or run indefinitely. Attempt to recover and restart runs up to eight times in case of DAQ error

2.b. Allow for control of non-DAQ components2.b.i. Components which run independently of DAQ can be started

or stopped, or reset in case of error2.c. Enable control of !light mode" (dark / LID) via LiveView or !livecmd

lightmode"2.c.i. When light mode changes, all light-sensitive components (e.g.

DAQ) are stopped and restarted for clean separation of “dark” and LID running periods

2.c.ii. Indicate light mode status via XML-RPC function, as well as via a disk file on the Experiment Control node. Transmit light mode changes to LiveView

2.d. Enable flasher operations via !livecmd flasher" command

3. Alert triggering and transmission features (alert history features described above)

3.a. LiveControl alerts3.a.i. Provide facility for alerting when individual runs fail, and when

runs repeatedly fail (i.e. the detector cannot be restarted)3.a.ii. Allow users to define new alerts based on minimum or

maximum values received from monitoring stream3.a.ii.1. Implement Schmitt trigger on same to prevent flapping

3.a.iii. Alerts can be configured to email one or more recipients when triggered

3.a.iv. Alerts can be configured to page the Winter-Overs3.a.v. Persist LiveControl alert settings so that they come into effect

every time LiveControl is restarted3.a.vi. Alerts are transmitted to LiveView3.b. Implement Subsystem Alerts3.b.i. Individual subsystems can send alerts at will, optionally

notifying recipients by email, with relevant payload data included in the alert.

3.b.ii. Alerts are transmitted to LiveView3.c. Implement Northern Data Flow Alert3.c.i. Cron job in North verifies data is flowing through Live and

emails the current Live administrator if there is an outage3.c.ii. When data stops flowing, clearly indicate that fact on the main

!Status" page of LiveView4. System Integration Features (features implemented in IceCube Live

core system in support of other systems; see also “Infrastructure Features,” below)

4.a. Integrate with IceCube Online Database (“I3OmDb”)4.a.i. Update run_summary table with run start/stop times and DAQ

event counts, continuously as data arrives from DAQ4.a.ii. Update flasher configuration tables when flasher subruns are

started4.a.iii. Transmit SQL query statements to North via SPADE for

synchronization with Northern database4.b. Implement flexible plug-in architecture for control and monitoring

of subsystems4.b.i. Components can be brought in !hot" or removed at any time

via the !livecmd control" and !livecmd ingore" commands

4.b.ii. Components can run on any machine inside the SPS network

4.b.iii. Components can send monitoring data to IceCube Live for alerting and/or for transmission to the LiveView Web Sites

4.b.iii.1. Support basic scalar data types (int, float, str, None)

4.b.iii.2. Support structured data types (JSON)

5. Subsystem-Specific Features

5.a. pDAQ (does not include the DAQ functions described above)5.a.i. Monitor rate deviation5.a.i.1. Collect current event counts from pDAQ in real time via monitoring

stream

5.a.i.2. Calculate fractional deviation from the current rate versus the last

twenty-four hours

5.a.i.3. Alert when rate deviation exceeds 10%, except during the first 15

minutes of runs

5.a.i.4. Show alerts and graph of rate deviation on main Status page in

LiveView

5.b. !SNDAQ" 5.b.i. Start/stop component via !livecmd" or LiveView5.b.ii. !SNDAQ" page in LiveView5.b.ii.1. Show current status

5.b.ii.2. Show recent alarms (highlight alarms less than 24 hours old in red)

5.b.ii.3. Show machine-specific performance metrics for sps-2ndbuild

5.b.ii.4. Show tabular summaries of recent indicators of SNDAQ performance

5.c. !ICL/B2" Monitoring Page5.c.i. Show graphs of minimum and maximum temperatures as

measured by ICL “weather geese”5.c.ii. Show humidity in B2 science lab and air flow readout from

Goose 25.c.iii. Show historical minima and maxima for all quantities. 5.c.iv. Generate alert when temperatures go out of range.5.d. PnF5.d.i. Start/stop/recover nine subsystems via !livecmd" or LiveView5.d.ii. (Detailed PnF Displays in LiveView are in progress)

5.g.i. Start/stop seven I3Moni components via LiveView or !livecmd"5.g.ii. Display graphs and statistics for current disk usage on sps-

i3moni5.g.iii. (Full integration of I3Moni Web site is in progress)5.h. Weather5.h.i. Show graphs and current values for South Pole temperature,

pressure, wind speed and direction5.i. Meteor Radar5.i.i. Show graphs of Meteor Radar signal strength as picked up by

the DOM !Discworld", over two hour, 24 hour and 7 day timescales.

6. Infrastructure Features

6.a. Transmit data between detector/test systems and LiveView Web sites at South Pole and Madison

6.a.i. Make use of ITS (northbound & southbound), SPADE email and SCP queues (northbound only) and direct TCP/IP (local only) when available

6.a.ii. Allow messages to be sent according to 4 “priority” levels: 6.a.ii.1. 1 (ITS) ~2 minute latency

6.a.ii.2. 2 (SPADE email) ~5 minute latency

6.a.ii.3. 3 (SPADE SCP) ~0-12 hour latency

6.a.ii.4. 4 (TCP/IP direct only) immediate

6.a.iii. Higher priority messages are copied to lower priority queues for redundancy

6.a.iv. Bandwidth limits apply to each queue6.a.iv.1. When bandwidth is saturated, the “noisiest” services lose data first

6.b. Implement hooks for custom data flows6.b.i. “Filewatcher” program scans for incoming files of interest in

Data Warehouse (DW), can be extended to look for new file types/locations

6.b.ii. “DBServer” program receives incoming data stored in DW or transmitted directly over TCP/IP or ITS. A handler for structured JSON data can be added for special handling of incoming data (e.g. for directly notifying a service running in Madison)

6.c. Implement direct access to LiveView Database6.c.i. Django admin panel allows for administrative users (e.g. Run

Coordinator) to browse and edit data in database6.c.ii. Direct, read-only MySQL access to most tables available to

users on UW cluster6.d. Multi-cluster support for LiveView6.d.i. Both SPS and SPTS clusters supported on production server

in Madison6.d.ii. !Localhost" cluster used for end-to-end testing on e.g. laptops6.d.iii. Cluster selectable via !Comms->Settings" menu option;

connection testable via ITS to South Pole or directly via TCP/IP to SPTS

6.e. “Messages” page in LiveView shows control messages sent to/from LiveView with alert, control, and run annotation information

6.f. “Stats” page in LiveView shows communications and database statistics and status of Data Warehouse files used for Live

6.g. “Chat” feature allows users to communicate via ITS if the Jabber chat room is not working.

6.h. Show IceCube Live user status and history in LiveView6.h.i. Show current users on status page6.h.ii. !User" page6.h.ii.1. Currently active users are highlighted

6.h.ii.2. !Operators" can see privileges of other users

6.h.iii. Detailed user info pages show login history for individual users

6.i. Online Documentation6.i.i. !Overview" page6.i.ii. !Installation" how-to6.i.iii. !Monitoring and alerts" how-to6.i.iv. !Subsystem control" how-to6.i.v. Developers guidelines6.i.vi. Advanced topics6.i.vii. Supporting infrastructure6.i.viii. User manual (PDF) for winter-over scientists and other

detector operators6.j. Repeatable, turn-key installation procedure for Web servers

(kickstart based)6.j.i. Includes database import, software dependencies, Python 2.6

installation, security settings, Apache configuration, SSL

Feature List

~90 features. Full list on DocuSnare: https://docushare.icecube.wisc.edu/dsweb/Get/Document-55522/

Categories:

8

1. Display Detector Status and History2. Control Data Acquisition and Stand-alone

Systems3. Alert triggering and transmission features4. System Integration Features 5. Subsystem-Specific Features6. Infrastructure Features7. Security Features8. Public Outreach Features

Examples:5.g.i Show graphs and current values for South Pole temperature, pressure, wind speed and direction1.c.ii.3 Show live time for selected runs on ‘Recent’ page7.d User logins timeout after 24 hours of inactivity

Tuesday, August 10, 2010

Page 9: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Design

Tuesday, August 10, 2010

Page 10: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Design - Overview

10

LiveControl control point for subsystems• Simple interfaces for control,

monitoring and logging• Generates/transmits alerts

LiveView Web-based UI• Transmits control requests to

LiveControl• Show current state and history

LiveCmd command-line interface• For direct access/special

operations over SSH

Status Control

LiveView

Operator

Alerts

LiveControl

DAQ

TWR

PnF

GRB

SN

SPS

...

LiveCmd

Operator

Tuesday, August 10, 2010

Page 11: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

SPS

sps-expcont

ITS

Email

Alerts

Alerts

SPS Web Server

pDAQ, other subsystems

LiveControl

B2 Science/Winter Overs

NH Web Server

DBServer

LiveView

MySQL +

Django ORM

Collaboration

Data Warehouse

SPADE

File

Wa

tch

er

SPADE

DBServer

File

Watc

her

MySQL +

Django ORM

LiveView

Disk Cache

disk

cache

disk

cache

System Diagram

11

Tuesday, August 10, 2010

Page 12: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Hardware

12

Where Names Function Model CPU Cores RAM (GB)

Disk (GB)

SPS sps-expcont*LiveControl, pDAQ, DB scripts, ...

HP DL385/G1

AMD Opteron 2.4 GHz

2x2 8 300

SPS sps-i3livelive.icecube.usap.gov LiveView HP DL380/

G5Intel

Xeon 3 GHz

2x4 16 400

222 Cygnus live.icecube.wisc.edu LiveView HP DL385/

G1AMD

Opteron 2.4 GHz

2x4 8 300

Hardware configurations to change in ’10-’11!!!

(*) Duplicated at SPTS

Tuesday, August 10, 2010

Page 13: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

LiveView Implementation Stack

13

JQuery JavaScript LibraryJQuery JavaScript LibraryYour BrowserYour Browser

Apache Web server + mod_wsgiApache Web server + mod_wsgi

Django Web FrameworkDjango Web Framework

PythonMySQLdb

Python ... other python libraries....

MySQLMySQLLinuxLinux

(LiveControl and LiveCmd:100% pure Python...)

Tuesday, August 10, 2010

Page 14: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

System Dependencies

14

LiveControl

LiveView

LiveCmd

ITS

SPADE

D.W.

LDAPMySQL

Python 2.3

(2.6)virtualenv

distutils

textile

simplejson

MySQLdb

Django

...

pDAQ PnF SNDAQ ...

NFS (perot, for SPTS)

jQuery

I3OmDb

(browsers)softly depends on

firmly depends on

Email

Apache

mod_wsgi

Tuesday, August 10, 2010

Page 15: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

pDAQ, SNDAQ PnF

ITS SPADE

Meteorology ICL Temperatures

Plug-in Integration of Sub-systems

15

Control

Logging

MonitoringAlerts

Custom display(s)

Tuesday, August 10, 2010

Page 16: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Status

Tuesday, August 10, 2010

Page 17: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Current Status

• Production use started April 1, 2009 (end of IC40)• 19 controllable services/components• 27 services sending monitoring data• Typical latencies ~ 5 min.• ~100k quantities transmitted/stored per day

• ~12 users per (week)day• 181 users to date

• 99.7% uptime of core control system

17

Tuesday, August 10, 2010

Page 18: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Code

Tuesday, August 10, 2010

Page 19: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Code Style Goals

- Modularity: reduce cross-dependencies- Testability: use doctests; see testing slide- Readability: use PEP-8 for new code & fixes- Releasability: only commit “working” code- Conciseness: KISS & DRY (maintainability)

See for yourself at http://code.icecube.wisc.edu/svn/projects/live/

19

Tuesday, August 10, 2010

Page 20: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

A Hierarchy of Testing:Level

Single moduleModels and URLs

Simple integrationEnd-to-end

QualificationProduction

20

HowPython ‘doctests’Django doctests

‘Livetests’Laptop simulation

+ browserSPTSSPS

For highest modularity, confidence and ease of debugging, ensure correctness at the lowest possible level.

Whencontinuously (TDD)before commitbefore commitas needed

before releaseafter deployment

Tuesday, August 10, 2010

Page 21: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Release discipline- Revision tracking in SVN- Issue tracking in Mantis• “Issue” = well-defined & tracked change

to the code• 1 feature <=> 1-10 issues• 1 bug fix <=> 1-2 issues

- Roughly monthly releases• 18 so far (initial: v0.9.1 latest: v1.4.0)• As features/fixes needed at SPS• As resolved issues pile up in the codebase• See RELEASE_NOTES file in source tree....21

Tuesday, August 10, 2010

Page 22: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Documentation

Control and Monitor your gigaton Antarctic neutrino detector

IceCube Live

By the IceCube Collaboration, R. Abbasi et. al.

Experiment Control

in near-real-time!

Tuesday, August 10, 2010

Page 23: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

IceCube Live online documentation*

https://live.icecube.wisc.edu/doc/

- Beginner and advanced sections- Developer how-to’s and guidelines- PDF Operator’s Document for WOs, experts

• Separate Wiki pages for IT staff for kickstarting Webservers• Documentation stored in SVN with I3Live source code • Documentation for non-operations physics users somewhat lacking?

(*) Documentation stored in SVN with I3Live source code 23

Tuesday, August 10, 2010

Page 24: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Security (see also Paul W. talk)

Assets/steps:• Django security (autoescaping, XSS protection, SQL

sanitizing, ...) (See http://www.djangobook.com/en/2.0/chapter20/)

• LDAP/PAM authentication of users• HTTPS for all traffic • Use ‘pdaq’ group to define operator privileges• Limit connection between LiveView & LiveControl• Periodic vulnerability scans by UW and USAP• Keep system logs for > 5 months

Results:• Roughly 2 automated attacks/week (since Mar.’10)• No successful intrusions detected since first prototype

brought online, April 2008.24

Tuesday, August 10, 2010

Page 25: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

LiveView User Hierarchy1. unauthenticated user (redirect to login page)2. ‘icecube’ user (cannot add run comments)3. ‘normal’ user4. ‘operator’ (pdaq group) - can start/stop things and

delete run comments5. ‘Django superuser’ - can use Django admin panel

to edit LiveView database directly

25

Tuesday, August 10, 2010

Page 26: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Data Accumulation (not including DW files)

26

Pruned and archived manually every few months(see https://live.icecube.wisc.edu/doc/infrastructure/)

April, 2008 - July, 2010Includes SPS & SPTS

Assume 10-fold increase in table sizes over lifetime of project (except moni/log).

Since 10-fold increase in moni tables alone does not bring system to its knees currently, I believe we are OK. Further pruning of e.g. Message or WarehouseFile tables should be possible.

27

Tuesday, August 10, 2010

Page 27: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Review History

System Review: You’re in it!

Code Reviews:• Chrisopher Webber, a Chicago Django expert,

reviewed LiveView code.Suggestions:• Adhere to PEP-8 standard (adopted for new code)• Use South migration tool when changing DB

tables (now done)• Other aesthetic/organizational suggestions not

yet implemented.

27

Tuesday, August 10, 2010

Page 28: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Subsystem Integration Status

pDAQ Dave G., Kael WorkingSN-DAQ Timo, JJ WorkingICL, B2 monitoring JJ WorkingWeather Victor WorkingMeteor Radar JJ/Dave WorkingITS JJ, Victor, ... WorkingPnF Torsten & Erik Needs display, filter ratesSPADE Victor PartialVerification James Pepper In progressI3Moni Kai Schatto In progressOptical Follow-Up Anna, Torsten, JJ StartedNagios IT group, Victor, ...? Needs decision

28

Tuesday, August 10, 2010

Page 29: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube LiveRoom for improvement (see also Sebastian Böser’s talk)

Aesthetics - have not been a priority and could be improved

DAQ Error Handling - DAQ-side errors are fairly opaque and confusing for operators

Testability - there are many, many tests, but many core features, particularly in the Web UI, do not have automated tests

• Python test code: 2548• Overall Python: 19891 (13% ratio. Ideally closer to 50%)

Graphing - the graphs could be more complex, more informative, more attractive, more readable (in progress)

29

Tuesday, August 10, 2010

Page 30: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Room for improvement, continued

Speed - some pages load slowly

Issue backlog - some 160+ issues remain in Mantis queue

....

Expertise & Manpower - this system, critical to detector performance and uptime, is used every day by the IceCube Collaboration, yet only one person has ‘enough’ expertise

30

Tuesday, August 10, 2010

Page 31: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Effort

Tuesday, August 10, 2010

Page 32: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Steady-State (hu)Manpower

Current level of effort:• 60% Jacobsen (also contributing to Operations, DAQ, ...)

• 25% Bittorf• ???% Frère• + subsystem integration work (Schatto, Pepper)

Current rate of implementation of new features is slow (requests arrive faster than they can be satisfied) -> Manage carefully (see Sebastian talk)

Best guess for long-term maintenance: at least 0.5 FTE (includes occasional feature additions based on science requirements and user requests).32

Tuesday, August 10, 2010

Page 33: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Conclusion

Tuesday, August 10, 2010

Page 34: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Conclusion

• The core IceCube Live system is implemented and has been in production since April 2009.

• Uptime of the control system is 99.7%• The system meets the minimum requirements for

operating and monitoring the detector• Most of the key subsystems have been fully or

partially integrated• There is opportunity for improvement in both the

core system and in subsystem integration• IceCube Live is a greatly improved Experiment

Control which cost the project roughly 1/4 of what previous versions cost.

34

Tuesday, August 10, 2010

Page 35: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Spares

Tuesday, August 10, 2010

Page 36: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Why so many Experiment Control systems?• 2005: Initial attempt ambitious but did not achieve

goals in time (IC09 startup)• 2006: Quick-and-dirty “pirate” implementation for

rDAQ/IC09• 2007: Third implementation (IC22, IC40)

functioned OK for the most part, but:- Status displays meagre- Only 2-3 systems integrated- Flasher implementation flawed- Robustness and maintenance issues

• The good news:Each implementation built on previous experience.36

Tuesday, August 10, 2010

Page 37: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Steering Development

Multiple phases of development:

37

2007

2008

2009

2010

Vision documentInitial Prototyping“IceCube Live AdvisoryGroup”Iterative testing/development at PoleProduction Start (IC59)

Continued, iterative improvements

Features

Tuesday, August 10, 2010

Page 38: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

LiveControl Block Diagram

38

DAQ

LiveControl Threads

Initialization,Monitoring &

Alert Dispatch

SPADE

dropbox

DOMHubMonitor Interface

Idle DAQ monitor

Monitoring Listener

Alert Trigger

domhub

monitor

status file

ITS

LiveView

SPADELocal

Alert DBI3OmDb

Recent

Rates

DAQother

subsystems

WOs

(North)

domhub

monitor

DAQ runner

LID state transition

Message Handler LiveCmd

control

state

(North)

Tuesday, August 10, 2010

Page 39: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Connecting LiveView <-> LiveControl

39

Modes of Connecting LiveView to LiveControl

ITS (Default NH)

live.icecube.wisc.edu

7000

its.icecube.wisc.edu

10801

its.icecube.southpole.usap.gov

10878

sps-expcont

7001

ITS-A (SPTS)

gull

7000

sps-expcont

7001

SPS-Direct(Default NPX)

live.icecube.wisc.edu

7000

spts64-expcont

7001

SPTS-Direct

localhost

7000

7001

Localhost

live.icecube.wisc.edu

7000

its-a.icecube.wisc.edu

10801

its-b.icecube.wisc.edu

10878

spts64-expcont

7001

View

Transport

Control

Tuesday, August 10, 2010

Page 40: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Data Transport layer — Priority Queues

40

Priority Name (Transport) Latency Max. Data Rate

1 ITS ~2min 20 B/sec

2 SPADE email (Iridium) ~5min 1 kB/sec

3 SPADE SCP (TDRSS) 0-12 hrs 2.3 kB/sec

4 Direct (TCP/IP) < 1 sec > 1MB/sec

Features:- Priority 1 duplicated in streams P2 & P3; P2 duplicated in P3- Respect max. data rates -- discard “noisy” subsystems first if need be- Monitor transport statistics (‘Stats’) pageSee proposed requirements at https://docushare.icecube.wisc.edu/dsweb/Get/Document-48537/

Tuesday, August 10, 2010

Page 41: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Example LiveCmd Operations on sps-expcont*

Check out code (e.g. on sps-expcont): $ svn co http://code.icecube.wisc.edu/svn/projects/live/trunk live

Install code: $ cd live && ./setup.py install

Stop DAQ:$ livecmd stop daq

(Re-)start LiveControl: $ livecmd launch

Start DAQ (default configuration): $ cd && ./starti3

Start individual component: $ livecmd start sndaq

* see also the Operator Documentation PDF41

Tuesday, August 10, 2010

Page 42: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

The Livetimes of Live

Different kinds of uptime:• LiveControl core system:

Crashes on 7/30/09 and 8/5/09. Root cause fixed.~1 day total down time. Uptime=99.7%

• Data Slowdowns (SPADE, ITS, I3Live issues):93% “prompt” uptime (June 2009 - July 2010), based on my email ‘forensics’

• LiveView Web site (hostage to DW, LDAP, ...)Don’t know uptime. Estimate > 90%.

42

Tuesday, August 10, 2010

Page 43: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Contributors

~28k Lines of code: (committed to trunk)24330 jacobsen # JJ 1998 vbittorf # Victor 1498 enielsen # Tex 499 mfrere # Michael Frère 37 dglo # Dave G. 3 jpepper # James Pepper, UA

28365 TOTAL # ... only!!

649 Mantis issues completed:

43

michaelvictor

tex

jacobsen87%

jacobsen96%

Tuesday, August 10, 2010

Page 44: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

0

200

400

600

March , 2009 August , 2009 January , 2010 June , 2010

Releases

44 Resolved Mantis Issues

0.9.1, 3/26/09

1.4.0, 6/14/10

IC79 start

5/31/10

IC59 start

5/20/09

Tuesday, August 10, 2010

Page 45: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Genesis (including design docs & reports)

• Discussion with Albrecht, 12-2007• “Vision” document, 12-07 https://docushare.icecube.wisc.edu/dsweb/Get/Document-45721/

• Prototyping and discussions at Pole, 1/08• Preliminary design document

https://docushare.icecube.wisc.edu/dsweb/Get/Document-45723

• Talks at Tuesday Call and Collab. Mtg. 3,4/08https://docushare.icecube.wisc.edu/dsweb/Get/Document-45839https://docushare.icecube.wisc.edu/dsweb/View/Collection-6147

• Development Phase, 2008Advisory Panel: Azriel, Martin, Erik, Timo, Kael

• SCAP 2009 status reporthttps://docushare.icecube.wisc.edu/dsweb/Get/Document-48979/

45

Tuesday, August 10, 2010

Page 46: Status and Overview - DocuShare€¦ · 1.c.iii. Show global event ... snapshot over 24 hours on main (status) page 1.c.v.2. Show detailed alert history on separate !alerts" page

IceCube Live

Genesis, Contd.

• IC59 Run (2009). Anvil to I3Live transition 4/1/09• Status document circulated to Collaboration, 9/09

https://docushare.icecube.wisc.edu/dsweb/Get/Document-51130/

• Roundtable discussion, Berlin, 9/09https://docushare.icecube.wisc.edu/dsweb/Get/Document-51292/

• First draft of new feature schedule, Denise/JJ, 12/09http://docs.google.com/View?id=dgs8xkph_12fh9b82d4

• Roundtable discussion in Annapolis, May, 2010

46

Tuesday, August 10, 2010