13

How eBay does Automatic Outage Planning

Embed Size (px)

Citation preview

Page 1: How eBay does Automatic Outage Planning
Page 2: How eBay does Automatic Outage Planning

Automatic Outage Planning at eBay

November 4, 20145

Kevin Isaacson eBay Batch Tools and Infrastructure

Page 3: How eBay does Automatic Outage Planning

3 Property of Automic Software. All rights reserved

Environment and background

• eBay Marketplaces Site environment– Jobs with direct impact on the site, billing, customer emails– ~1000 agents, 120,000 executions per day– 1090 site databases

• Reasons for Outage Tool– Weekly maintenance window Thursday evenings, multiple databases taken down– Job dependencies not provided by PD– Job volume makes it impossible to manually plan around maintenance– Large maintenance can cause hundreds of job aborts

Page 4: How eBay does Automatic Outage Planning

4 Property of Automic Software. All rights reserved

• Solve two root problems– How do we determine which jobs connect to which databases– How do we have those jobs avoid outages without manual intervention

• Design system using Automic internal features wherever possible

Goals

Page 5: How eBay does Automatic Outage Planning

5 Property of Automic Software. All rights reserved

Outage Tool Architecture

• Two main components– Dependency scanner – Finds external dependencies (sqlnet, ftp, http

connections)– Outage prefix script – Uses dependency data to prevent jobs aborts

• All data stored in Automic VARA objects and archive keys– Can be viewed within GUI without additional tools– No additional tables required

Page 6: How eBay does Automatic Outage Planning

6 Property of Automic Software. All rights reserved

Data Population

VARA.NEXT_OUTAGE

AGENT

JOBS.DEP_SCAN

AGENT

JOBS.DEP_SCAN

AGENT

JOBS.DEP_SCAN

VARA.DEP.WORKFLOW.JOB

JOB

Archive KeyOB=DELAY

Post Process tab:INC POST_OUTAGE_RESTART

Page 7: How eBay does Automatic Outage Planning

7 Property of Automic Software. All rights reserved

Dependency Scanner: Overview• Perl script • Scheduled every 303 minutes on agent group with all agents• Uses OS commands to gather open connections and tie them to job

objects– ps command to gather list of running Automic jobs

• Convert name of temp file to base 10 job runid– pstree command gets child processes– pfiles (Solaris) or lsof (Linux) to get list of external connections– Dependency data written to stdout. Example:

DEP=187270226:caty2phx8.vip.ebay.com:1521DEP=187332422:phxuc4app02.phx.ebay.com:2221DEP=187332422:cal.vip.phx.ebay.com:1118

Page 8: How eBay does Automatic Outage Planning

8 Property of Automic Software. All rights reserved

• Harvester job parses output from RT table, looks up job and workflow names, creates Automation Engine script files

• Executes CallAPI on the script files to update VARA objects:PUT_VAR VARA.DEP.CIMS_RADAR_ASSERTION_SUB.V3_CIMS_RADAR_ASSERTION_BATCH_3, "tns.vip.ebay.com", "1521", "Mon Jan 23 16:52:19 2012”:PUT_VAR VARA.DEP.RADAR_ARCHIVE_EVENTS.V3_RADAR_ARCHIVE_EVENTS_30, "storageservice.vip.slc.ebay.com", "80", "Mon Jan 23 16:52:20 2012”

Page 9: How eBay does Automatic Outage Planning

9 Property of Automic Software. All rights reserved

Dependency Variable Example

VARA.DEP.WORKFLOW.JOB

Page 10: How eBay does Automatic Outage Planning

10 Property of Automic Software. All rights reserved

Outage Prefix Script

• Called by \HEADER\HEADER.UNIX.USER.PRE in client 0• Runs before every job

Exit Exit Exit

Perform defined

behavior

Is a currentor future outage

defined?

Per my ERT, will I be affected?

Are any resources in the

outage also in my VARA?

N N N

Y Y Y

Page 11: How eBay does Automatic Outage Planning

11 Property of Automic Software. All rights reserved

VARA.NEXT_OUTAGE

Page 12: How eBay does Automatic Outage Planning

12 Property of Automic Software. All rights reserved

Supported Outage Behaviors• Put the following in either of the Archive Key fields (on General tab) for the JOB.• OB=[behavior value];

– DELAY - Job will delay until the end of the outage window (it is not necessary to "delete waiting jobs" because the queued jobs will all evaluate individually whether they should start or skip)

– SKIP - We decided that DELAY is almost always better. The job will skip any runs that would extend into the outage window

– RESTART (not implemented yet) - if it aborts, change exit code to 0 and message to "MAINT_RESTARTED" and restart it after the window.

• Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job

– IGNORE - if it aborts, change exit code to 0 and message to "MAINT_IGNORED” 

• Must include the command ":INCLUDE INC.OUTAGE.POSTPROCESS" on the post-processing tab of the job

Page 13: How eBay does Automatic Outage Planning

13 Property of Automic Software. All rights reserved

Weaknesses & Improvements

• Tough to catch connections for short-running jobs• Takes days or weeks to build full dependency data for new job• Overhead for prefix script on every job• No input validation for manually edited items (OB in archive key,

VARA.NEXT.OUTAGE settings)• Dependencies apply only to jobs, would be nice to apply outage behavior

to whole workflow