Upload
automic-software
View
923
Download
1
Embed Size (px)
Citation preview
Automatic Outage Planning at eBay
November 4, 20145
Kevin Isaacson eBay Batch Tools and Infrastructure
3 Property of Automic Software. All rights reserved
Environment and background
• eBay Marketplaces Site environment– Jobs with direct impact on the site, billing, customer emails– ~1000 agents, 120,000 executions per day– 1090 site databases
• Reasons for Outage Tool– Weekly maintenance window Thursday evenings, multiple databases taken down– Job dependencies not provided by PD– Job volume makes it impossible to manually plan around maintenance– Large maintenance can cause hundreds of job aborts
4 Property of Automic Software. All rights reserved
• Solve two root problems– How do we determine which jobs connect to which databases– How do we have those jobs avoid outages without manual intervention
• Design system using Automic internal features wherever possible
Goals
5 Property of Automic Software. All rights reserved
Outage Tool Architecture
• Two main components– Dependency scanner – Finds external dependencies (sqlnet, ftp, http
connections)– Outage prefix script – Uses dependency data to prevent jobs aborts
• All data stored in Automic VARA objects and archive keys– Can be viewed within GUI without additional tools– No additional tables required
6 Property of Automic Software. All rights reserved
Data Population
VARA.NEXT_OUTAGE
AGENT
JOBS.DEP_SCAN
AGENT
JOBS.DEP_SCAN
AGENT
JOBS.DEP_SCAN
VARA.DEP.WORKFLOW.JOB
JOB
Archive KeyOB=DELAY
Post Process tab:INC POST_OUTAGE_RESTART
7 Property of Automic Software. All rights reserved
Dependency Scanner: Overview• Perl script • Scheduled every 303 minutes on agent group with all agents• Uses OS commands to gather open connections and tie them to job
objects– ps command to gather list of running Automic jobs
• Convert name of temp file to base 10 job runid– pstree command gets child processes– pfiles (Solaris) or lsof (Linux) to get list of external connections– Dependency data written to stdout. Example:
DEP=187270226:caty2phx8.vip.ebay.com:1521DEP=187332422:phxuc4app02.phx.ebay.com:2221DEP=187332422:cal.vip.phx.ebay.com:1118
8 Property of Automic Software. All rights reserved
• Harvester job parses output from RT table, looks up job and workflow names, creates Automation Engine script files
• Executes CallAPI on the script files to update VARA objects:PUT_VAR VARA.DEP.CIMS_RADAR_ASSERTION_SUB.V3_CIMS_RADAR_ASSERTION_BATCH_3, "tns.vip.ebay.com", "1521", "Mon Jan 23 16:52:19 2012”:PUT_VAR VARA.DEP.RADAR_ARCHIVE_EVENTS.V3_RADAR_ARCHIVE_EVENTS_30, "storageservice.vip.slc.ebay.com", "80", "Mon Jan 23 16:52:20 2012”
9 Property of Automic Software. All rights reserved
Dependency Variable Example
VARA.DEP.WORKFLOW.JOB
10 Property of Automic Software. All rights reserved
Outage Prefix Script
• Called by \HEADER\HEADER.UNIX.USER.PRE in client 0• Runs before every job
Exit Exit Exit
Perform defined
behavior
Is a currentor future outage
defined?
Per my ERT, will I be affected?
Are any resources in the
outage also in my VARA?
N N N
Y Y Y
11 Property of Automic Software. All rights reserved
VARA.NEXT_OUTAGE
12 Property of Automic Software. All rights reserved
Supported Outage Behaviors• Put the following in either of the Archive Key fields (on General tab) for the JOB.• OB=[behavior value];
– DELAY - Job will delay until the end of the outage window (it is not necessary to "delete waiting jobs" because the queued jobs will all evaluate individually whether they should start or skip)
– SKIP - We decided that DELAY is almost always better. The job will skip any runs that would extend into the outage window
– RESTART (not implemented yet) - if it aborts, change exit code to 0 and message to "MAINT_RESTARTED" and restart it after the window.
• Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job
– IGNORE - if it aborts, change exit code to 0 and message to "MAINT_IGNORED”
• Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job
13 Property of Automic Software. All rights reserved
Weaknesses & Improvements
• Tough to catch connections for short-running jobs• Takes days or weeks to build full dependency data for new job• Overhead for prefix script on every job• No input validation for manually edited items (OB in archive key,
VARA.NEXT.OUTAGE settings)• Dependencies apply only to jobs, would be nice to apply outage behavior
to whole workflow