Supporting operations personnel a software engineers perspective

NICTA Copyright 2012 From imagination to impact

Supporting Operations

Personnel: A Software

Engineering PerspectiveLen Bass

NICTA Copyright 2012 From imagination to impact2

About NICTA

National ICT Australia

• Federal and state funded research company established in 2002

• Largest ICT research resource in Australia

• National impact is an important success metric

• ~700 staff/students working in 5 labs across major capital cities

• 7 university partners• Providing R&D services, knowledge

transfer to Australian (and global) ICT industry

NICTA technology is in over 1 billion mobile

phones


Traditional View from Software Engineers

3

Application

Cloud Environment

Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved?

End users

Developers


A Broader View

4

Application

Cloud Environment

Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application.

ConsumerOperator

End users

Developers


My Message: Consider the Operator in this Picture

5

Application

Cloud Environment

ConsumerOperator

End users

DevelopersComputer operations is a domain that impacts every application that operates in an enterprise environment. As such, Software Engineers need to be aware of how actions of operators can affect their application and how actions of their application can simplify life for operators..


Business Context

“Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.”

Change/configuration/release integration and hand off are all operations issues.

Gartner - http://www.rbiassets.com/getfile.ashx/42112626510

"I&O [Infrastructure and operations] represents approximately 60 percent of total IT spending worldwide, "

http://www.gartner.com/it/page.jsp?id=1807615

6


Outline

• Overview of operations domain– What do operators do?– What can go wrong with what they do?

• Some results NICTA has achieved or activities we have ongoing

7


What Do Operators Do?

8

Akamai’s NOC in Cambridge, Massachusetts

• Monitor and control data center/network/system activity– Install new/upgraded

applications/middleware/configurations/hardware

• Support business continuity through back ups and disaster recovery


Monitor and Control

• Data Center– Total number and type of resources (may be virtual)

• Processors• Storage • Network

• Network– Intrusion detection– Routing – Loading

• System– Allocation to resources– Install/uninstall– Configure

9


What can go Wrong with Monitor and Control?

Everything that was on previous slide.• Failure

• Installations can fail• Resources fail and must be replaced

• Overload– Resources are over/under loaded and must be

supplemented/removed– Networks get overloaded and routing must be changed

• Error– Routing may be incorrectly specified– Allocation of systems to resources may be incorrect– Configurations can be incorrectly specified

10


Install New/Upgraded Applications

• Specifying configuration for applications• Synchronizing state for upgraded applications• Testing new/upgraded applications in target

environment• Allocating resources for new version

11


What Can go Wrong with Installation?

• Again its everything.– Configuration can be misspecified– Cut over to new version may leave inconsistent state– Upgrade to level N of the stack may break software in

level >N of the stack– Testing environment may not appropriately mirror real

environment– Configuration of one level of the stack may be

inconsistent with requirements of another level.

12


Supporting Business Continuity

• Disasters happen – natural or human causes• Backing up data provides recovery possibility

– Lag between last version backed up and when disaster happens

– In the Cloud, backing up large amounts of data to different geographic regions takes time.

13


Hand Offs

• Problems can arise when a shift changes– What problems did old shift deal with?– What problems were totally solved?– What problems were partially solved?– What operations activities are currently ongoing?

14


Operations is a Target Rich Environment

• There are many existing tools. Operation of data centers would not work without tools

• Much room for improvement (see Gartner quote)• Some general approaches for improvement

– Make software systems operations and tools process and incident aware. E.g. make them aware of upgrade or shift change

– Model operations processes and systems using a single model.• Model analysis will provide opportunities for detecting trade offs between

human and automated activities. • Model might also enable smoother error detection

15


Outline

• Overview of operations domain• Some results we have achieved or activities

we have ongoing– Disaster Recovery product– Upgrade– Operator undo– Installation process.

16


Disaster Recovery

• Clouds fail – Amazon had three outages in 2011 that affected whole availability zones or regions.

• NICTA has a subsidiary (Yuruware) with a non-intrusive disaster recovery product (Bolt).

• Bolt copies data periodically to a back up region.• Bolt utilizes sophisticated data movement

techniques to reduce time required to back up• This is an insurance policy.

17


Next Problem – Upgrade

• Upgrades are a very common occurrence• Upgrade frequency of some common systems

• Some systems have multiple releases per day, driven by developers – continuous deployment

18

Application Average release interval

Facebook (platform) < 7 days

Google Docs <50 days

Media Wiki 21 days

Joomla 30 days


Various Upgrade Strategies

• How many at once?– One at a time (rolling upgrade)– Groups at a time (staged upgrade, e.g. canaries. This

is using production environment for testing)– All at once (big flip)

• How long are new versions tested to determine correctness?– Period based – for some period of time – Load based – under some utilization assumptions

• What happens to old versions?– Replaced en masse– Maintained for some period for compatibility purposes

19


Having Multiple Versions Simulaneously Active May Lead to Mixed Version Race Condition

20

Server 2 (new version

3

4

X ERROR

Initial request

Client (browser)Server 1 (old version

1

2

5

Start rolling upgrade

HTTP reply with embedded JavaScript

AJAX callback


One Method for Preventing Mixed Version Race Condition is to Make Load Balancers Version Aware

Client may request particular

version of a service

External facing Router (wrt to

cloud)

Internal Router

Server for Version A


Server for Version B

Internal Router


Server for Version B

21

At each level of the routing hierarchy there are two possibilities for each request• Request is neutral with respect to

version• Request specifies versionRouting must• Be fast to ensure rapid response• Satisfy “goodness” criteria for

scheduling• Conform to client request wrt

version.

In addition:• Servers are being

upgraded to a later version while servicing client requests

• Load variation may trigger elasticity rules


What is Criterion for Measuring Load Balancer Scheduling?

• What is “goodness” with respect to routing decisions within the constraints of scheduling strategy and version awareness?– Uniform distribution of requests?– Keeping utilization within bounds? – Utilizing wide variety of clients?– Other?

• Main result so far. Version awareness is incompatible with any of the above “goodness” criteria for the staged upgrade strategy.

22


Canary or Staged Strategy

• Upgrade one or several servers to new version and leave them for some time.

• Formulation:– Staged upgrade

• M version A servers (constant number)• N version B servers (constant number)• Fixed number of clients

– Version aware• Once a client has had a request serviced by a version B

server it cannot subsequently have any requests serviced by any version A server.

23


Bifurcation of Clients

• Clients are bifurcated into version A clients and version B clients after some time– Intuitively, for each client, either it is serviced by a

server with version B and consequently never served by any server with version A or never served by a server with version B. So each client ends in up the Server A class or the Server B class but not both.

• We call clients that end up being serviced by services with version A, class A clients. Similarly, for class B clients.

• Allowing additional clients does not fundamentally change result.

24


Bifurcation of Clients Implies

• Cannot control for utilization unless create new instances of version B in response to demand– There are a fixed number of clients sending requests to a fixed

number of servers with version B. Cannot vary the number of servers to reflect the load generated by the fixed set of clients. Consequently cannot control the utilization by servers with version B.

• Cannot control for uniform distribution.– Uniform distribution means that every request has an equal

change of being sent to any server. If a client is in class A, then it has 0% chance of being sent to a server with Version B.

• Difficult to control for wide variety of clients.– Variations among the clients must be mirrored within class A and

class B clients since the classes are fixed after the bifurcation. This is difficult to accomplish since types of variations that are important are usually not known.

25


Questions to Answer.

– How long does it take to reach bifurcated state under what assumptions?

– How can the goals of staged upgrade be achieved within the constraints of version awareness?

26


Next Problem

• Operators use scripts to perform actions such as update

• Scripts may fail– May be result of API failure (more on this later)– May be desire to set up testing environment– May be result of failure of underlying virtual machine.

• When a script fails, the operator may wish to return to a known state (undo several operations)

27


Operator Undo

• Not always that straight-forward:– Attaching volume is no problem while the instance is

running, detaching might be problematic– Creating / changing auto-scaling rules has effect on

number of running instances• Cannot terminate additional instances, as the rule would

create new ones!

– Deleted / terminated / released resources are gone!

28


Undo for System Operators

29

+ commit+ pseudo-delete

begin-transaction rollback

dododo

Administrator


Approach

30


dododo

Sense cloud resources states


Administrator

Undo System


Approach

31


dododo



Administrator

Undo System

Goal stateGoal state

Initial state

Initial state



dododo



PlanGenerate codeExecute

Administrator

Undo System

Goal stateGoal state

Initial state

Initial state

Set of actionsSet of

actions

Approach

32


What about API Failures?

• Operator scripts make heavy use of checking or controlling state of resources– Start/stop VM– Is VM active?

• These scripts becomes calls to the cloud provider’s API.

• Calls may fail– Underlying VM has failed– Eventual consistency.

33


We Have Performed an Empirical Study of API Failures in EC2

• 922 cases out of 1109 reported API-related cases in the EC2 forum from 2010 to 2012 are API failures (rather than feature requests or general inquiries).

• We classified the extracted API failures into four types of failures: – content failures, – late timing failures, – halt failures, and – erratic failures.

34


Results

• A majority (60%) of the cases of API failure are related to stuck API calls or unresponsive API calls.

• A large portion (12%) of the cases are about slow responsive API calls.

• 19% of the cases are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls.

• 9% of the cases reported that their calls were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later.

35


Next Problem - Operations Processes

• We are looking at the process of installing new software– Error Prone– Potential process improvements.

36


Motivating Scenario

• You change the operating environment for an application– Configuration change– Version change– Hardware change

• Result is degraded performance • When the software stack is deep with portions

from different suppliers, the result is frequently:

37


Why is Installation Error Prone?

• Installation is complicated.– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for

Linux are ~250 pages each– Apache description of addresses and ports (one out of 16

descriptions) has following elements:• Choosing and specifying ports for the server to listen to• IPv4 and IPv6• Protocols• Virtual Hosts

– The number of configuration options that must be set can be large

• Hadoop has 206 options• HBase has 64

– Many dependencies are not visible until execution

38


Installation Processes

• Processes may be– Undocumented– Out of date– Insufficiently detailed

• Our goal is to build process model including error recovery mechanisms

39


Our Activities

40

• Create up to date process models for installation processes. Information sources are– Process discovery from logs– Process formalization from existing written

descriptions.

• Process descriptions can be used to– Make trade offs – Make recommendations in real time to operations

staff– Recommend setting checkpoints for potential later

undo, before a risky part of a process is entered– Assist in the detection of errors


Hard Problems

41

• Creating accurate process models– Exception handling mechanisms are not well

documented– Labor intensive. – Our approach

• Top down modeling using process modeling formalism• Bottom up process mining from error logs

• Diagnosing errors


Why is Error Diagnosis Hard?

In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it.Diagnosis involves correlating messages from• different distributed servers• different portions of the

software stack and determining the root cause of the error.The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.


Test Bed

43

Our current test bed is the Hbase stack


Currently Performing Analysis of Configuration Errors

44

• Cross stack errors may take hours to diagnose– Log files are inconsistent– Error message may not give context necessary to

determine root cause.


Where to Find Information about Operations Domain?

• Every open source program requires a variety of configuration parameters.

• Every modern application depends on a variety of middleware so cross domain examples should be readily available.

• Most organizations have extensive processes for their operations personnel. Use these processes as a framework for investigating process/product interactions.

45


Summary

• Operations problems will account for the majority of outages and IT costs in the next several years.

• The operations space is a rich source of research problems that has been insufficiently mined.

• Best way to determine what problems to attack is to monitor or interview operators

46


NICTA Team

• Anna Liu• Alan Fekete• Min Fu• Jim Zhanwen Li• Qinghua Lu• Hiroshi Wada• Ingo Weber• Xiwei Xu• Liming Zhu

47

Technology

Supporting operations personnel a software engineers perspective