Upload
len-bass
View
107
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A survey of the operations domain and what can go wrong with operations activites
Citation preview
NICTA Copyright 2012 From imagination to impact
Supporting Operations
Personnel: A Software
Engineering PerspectiveLen Bass
NICTA Copyright 2012 From imagination to impact2
About NICTA
National ICT Australia
• Federal and state funded research company established in 2002
• Largest ICT research resource in Australia
• National impact is an important success metric
• ~700 staff/students working in 5 labs across major capital cities
• 7 university partners• Providing R&D services, knowledge
transfer to Australian (and global) ICT industry
NICTA technology is in over 1 billion mobile
phones
NICTA Copyright 2012 From imagination to impact
Traditional View from Software Engineers
3
Application
Cloud Environment
Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved?
End users
Developers
NICTA Copyright 2012 From imagination to impact
A Broader View
4
Application
Cloud Environment
Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application.
ConsumerOperator
End users
Developers
NICTA Copyright 2012 From imagination to impact
My Message: Consider the Operator in this Picture
5
Application
Cloud Environment
ConsumerOperator
End users
DevelopersComputer operations is a domain that impacts every application that operates in an enterprise environment. As such, Software Engineers need to be aware of how actions of operators can affect their application and how actions of their application can simplify life for operators..
NICTA Copyright 2012 From imagination to impact
Business Context
“Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.”
Change/configuration/release integration and hand off are all operations issues.
Gartner - http://www.rbiassets.com/getfile.ashx/42112626510
"I&O [Infrastructure and operations] represents approximately 60 percent of total IT spending worldwide, "
http://www.gartner.com/it/page.jsp?id=1807615
6
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain– What do operators do?– What can go wrong with what they do?
• Some results NICTA has achieved or activities we have ongoing
7
NICTA Copyright 2012 From imagination to impact
What Do Operators Do?
8
Akamai’s NOC in Cambridge, Massachusetts
• Monitor and control data center/network/system activity– Install new/upgraded
applications/middleware/configurations/hardware
• Support business continuity through back ups and disaster recovery
NICTA Copyright 2012 From imagination to impact
Monitor and Control
• Data Center– Total number and type of resources (may be virtual)
• Processors• Storage • Network
• Network– Intrusion detection– Routing – Loading
• System– Allocation to resources– Install/uninstall– Configure
9
NICTA Copyright 2012 From imagination to impact
What can go Wrong with Monitor and Control?
Everything that was on previous slide.• Failure
• Installations can fail• Resources fail and must be replaced
• Overload– Resources are over/under loaded and must be
supplemented/removed– Networks get overloaded and routing must be changed
• Error– Routing may be incorrectly specified– Allocation of systems to resources may be incorrect– Configurations can be incorrectly specified
10
NICTA Copyright 2012 From imagination to impact
Install New/Upgraded Applications
• Specifying configuration for applications• Synchronizing state for upgraded applications• Testing new/upgraded applications in target
environment• Allocating resources for new version
11
NICTA Copyright 2012 From imagination to impact
What Can go Wrong with Installation?
• Again its everything.– Configuration can be misspecified– Cut over to new version may leave inconsistent state– Upgrade to level N of the stack may break software in
level >N of the stack– Testing environment may not appropriately mirror real
environment– Configuration of one level of the stack may be
inconsistent with requirements of another level.
12
NICTA Copyright 2012 From imagination to impact
Supporting Business Continuity
• Disasters happen – natural or human causes• Backing up data provides recovery possibility
– Lag between last version backed up and when disaster happens
– In the Cloud, backing up large amounts of data to different geographic regions takes time.
13
NICTA Copyright 2012 From imagination to impact
Hand Offs
• Problems can arise when a shift changes– What problems did old shift deal with?– What problems were totally solved?– What problems were partially solved?– What operations activities are currently ongoing?
14
NICTA Copyright 2012 From imagination to impact
Operations is a Target Rich Environment
• There are many existing tools. Operation of data centers would not work without tools
• Much room for improvement (see Gartner quote)• Some general approaches for improvement
– Make software systems operations and tools process and incident aware. E.g. make them aware of upgrade or shift change
– Model operations processes and systems using a single model.• Model analysis will provide opportunities for detecting trade offs between
human and automated activities. • Model might also enable smoother error detection
15
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain• Some results we have achieved or activities
we have ongoing– Disaster Recovery product– Upgrade– Operator undo– Installation process.
16
NICTA Copyright 2012 From imagination to impact
Disaster Recovery
• Clouds fail – Amazon had three outages in 2011 that affected whole availability zones or regions.
• NICTA has a subsidiary (Yuruware) with a non-intrusive disaster recovery product (Bolt).
• Bolt copies data periodically to a back up region.• Bolt utilizes sophisticated data movement
techniques to reduce time required to back up• This is an insurance policy.
17
NICTA Copyright 2012 From imagination to impact
Next Problem – Upgrade
• Upgrades are a very common occurrence• Upgrade frequency of some common systems
• Some systems have multiple releases per day, driven by developers – continuous deployment
18
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 days
Joomla 30 days
NICTA Copyright 2012 From imagination to impact
Various Upgrade Strategies
• How many at once?– One at a time (rolling upgrade)– Groups at a time (staged upgrade, e.g. canaries. This
is using production environment for testing)– All at once (big flip)
• How long are new versions tested to determine correctness?– Period based – for some period of time – Load based – under some utilization assumptions
• What happens to old versions?– Replaced en masse– Maintained for some period for compatibility purposes
19
NICTA Copyright 2012 From imagination to impact
Having Multiple Versions Simulaneously Active May Lead to Mixed Version Race Condition
20
Server 2 (new version
3
4
X ERROR
Initial request
Client (browser)Server 1 (old version
1
2
5
Start rolling upgrade
HTTP reply with embedded JavaScript
AJAX callback
NICTA Copyright 2012 From imagination to impact
One Method for Preventing Mixed Version Race Condition is to Make Load Balancers Version Aware
Client may request particular
version of a service
External facing Router (wrt to
cloud)
Internal Router
Server for Version A
Server for Version A
Server for Version B
Internal Router
Server for Version A
Server for Version B
21
At each level of the routing hierarchy there are two possibilities for each request• Request is neutral with respect to
version• Request specifies versionRouting must• Be fast to ensure rapid response• Satisfy “goodness” criteria for
scheduling• Conform to client request wrt
version.
In addition:• Servers are being
upgraded to a later version while servicing client requests
• Load variation may trigger elasticity rules
NICTA Copyright 2012 From imagination to impact
What is Criterion for Measuring Load Balancer Scheduling?
• What is “goodness” with respect to routing decisions within the constraints of scheduling strategy and version awareness?– Uniform distribution of requests?– Keeping utilization within bounds? – Utilizing wide variety of clients?– Other?
• Main result so far. Version awareness is incompatible with any of the above “goodness” criteria for the staged upgrade strategy.
22
NICTA Copyright 2012 From imagination to impact
Canary or Staged Strategy
• Upgrade one or several servers to new version and leave them for some time.
• Formulation:– Staged upgrade
• M version A servers (constant number)• N version B servers (constant number)• Fixed number of clients
– Version aware• Once a client has had a request serviced by a version B
server it cannot subsequently have any requests serviced by any version A server.
23
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients
• Clients are bifurcated into version A clients and version B clients after some time– Intuitively, for each client, either it is serviced by a
server with version B and consequently never served by any server with version A or never served by a server with version B. So each client ends in up the Server A class or the Server B class but not both.
• We call clients that end up being serviced by services with version A, class A clients. Similarly, for class B clients.
• Allowing additional clients does not fundamentally change result.
24
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients Implies
• Cannot control for utilization unless create new instances of version B in response to demand– There are a fixed number of clients sending requests to a fixed
number of servers with version B. Cannot vary the number of servers to reflect the load generated by the fixed set of clients. Consequently cannot control the utilization by servers with version B.
• Cannot control for uniform distribution.– Uniform distribution means that every request has an equal
change of being sent to any server. If a client is in class A, then it has 0% chance of being sent to a server with Version B.
• Difficult to control for wide variety of clients.– Variations among the clients must be mirrored within class A and
class B clients since the classes are fixed after the bifurcation. This is difficult to accomplish since types of variations that are important are usually not known.
25
NICTA Copyright 2012 From imagination to impact
Questions to Answer.
– How long does it take to reach bifurcated state under what assumptions?
– How can the goals of staged upgrade be achieved within the constraints of version awareness?
26
NICTA Copyright 2012 From imagination to impact
Next Problem
• Operators use scripts to perform actions such as update
• Scripts may fail– May be result of API failure (more on this later)– May be desire to set up testing environment– May be result of failure of underlying virtual machine.
• When a script fails, the operator may wish to return to a known state (undo several operations)
27
NICTA Copyright 2012 From imagination to impact
Operator Undo
• Not always that straight-forward:– Attaching volume is no problem while the instance is
running, detaching might be problematic– Creating / changing auto-scaling rules has effect on
number of running instances• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
28
NICTA Copyright 2012 From imagination to impact
Undo for System Operators
29
+ commit+ pseudo-delete
begin-transaction rollback
dododo
Administrator
NICTA Copyright 2012 From imagination to impact
Approach
30
begin-transaction rollback
dododo
Sense cloud resources states
Sense cloud resources states
Administrator
Undo System
NICTA Copyright 2012 From imagination to impact
Approach
31
begin-transaction rollback
dododo
Sense cloud resources states
Sense cloud resources states
Administrator
Undo System
Goal stateGoal state
Initial state
Initial state
NICTA Copyright 2012 From imagination to impact
begin-transaction rollback
dododo
Sense cloud resources states
Sense cloud resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal stateGoal state
Initial state
Initial state
Set of actionsSet of
actions
Approach
32
NICTA Copyright 2012 From imagination to impact
What about API Failures?
• Operator scripts make heavy use of checking or controlling state of resources– Start/stop VM– Is VM active?
• These scripts becomes calls to the cloud provider’s API.
• Calls may fail– Underlying VM has failed– Eventual consistency.
33
NICTA Copyright 2012 From imagination to impact
We Have Performed an Empirical Study of API Failures in EC2
• 922 cases out of 1109 reported API-related cases in the EC2 forum from 2010 to 2012 are API failures (rather than feature requests or general inquiries).
• We classified the extracted API failures into four types of failures: – content failures, – late timing failures, – halt failures, and – erratic failures.
34
NICTA Copyright 2012 From imagination to impact
Results
• A majority (60%) of the cases of API failure are related to stuck API calls or unresponsive API calls.
• A large portion (12%) of the cases are about slow responsive API calls.
• 19% of the cases are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls.
• 9% of the cases reported that their calls were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later.
35
NICTA Copyright 2012 From imagination to impact
Next Problem - Operations Processes
• We are looking at the process of installing new software– Error Prone– Potential process improvements.
36
NICTA Copyright 2012 From imagination to impact
Motivating Scenario
• You change the operating environment for an application– Configuration change– Version change– Hardware change
• Result is degraded performance • When the software stack is deep with portions
from different suppliers, the result is frequently:
37
NICTA Copyright 2012 From imagination to impact
Why is Installation Error Prone?
• Installation is complicated.– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each– Apache description of addresses and ports (one out of 16
descriptions) has following elements:• Choosing and specifying ports for the server to listen to• IPv4 and IPv6• Protocols• Virtual Hosts
– The number of configuration options that must be set can be large
• Hadoop has 206 options• HBase has 64
– Many dependencies are not visible until execution
38
NICTA Copyright 2012 From imagination to impact
Installation Processes
• Processes may be– Undocumented– Out of date– Insufficiently detailed
• Our goal is to build process model including error recovery mechanisms
39
NICTA Copyright 2012 From imagination to impact
Our Activities
40
• Create up to date process models for installation processes. Information sources are– Process discovery from logs– Process formalization from existing written
descriptions.
• Process descriptions can be used to– Make trade offs – Make recommendations in real time to operations
staff– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered– Assist in the detection of errors
NICTA Copyright 2012 From imagination to impact
Hard Problems
41
• Creating accurate process models– Exception handling mechanisms are not well
documented– Labor intensive. – Our approach
• Top down modeling using process modeling formalism• Bottom up process mining from error logs
• Diagnosing errors
NICTA Copyright 2012 From imagination to impact
Why is Error Diagnosis Hard?
In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it.Diagnosis involves correlating messages from• different distributed servers• different portions of the
software stack and determining the root cause of the error.The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.
NICTA Copyright 2012 From imagination to impact
Test Bed
43
Our current test bed is the Hbase stack
NICTA Copyright 2012 From imagination to impact
Currently Performing Analysis of Configuration Errors
44
• Cross stack errors may take hours to diagnose– Log files are inconsistent– Error message may not give context necessary to
determine root cause.
NICTA Copyright 2012 From imagination to impact
Where to Find Information about Operations Domain?
• Every open source program requires a variety of configuration parameters.
• Every modern application depends on a variety of middleware so cross domain examples should be readily available.
• Most organizations have extensive processes for their operations personnel. Use these processes as a framework for investigating process/product interactions.
45
NICTA Copyright 2012 From imagination to impact
Summary
• Operations problems will account for the majority of outages and IT costs in the next several years.
• The operations space is a rich source of research problems that has been insufficiently mined.
• Best way to determine what problems to attack is to monitor or interview operators
46
NICTA Copyright 2012 From imagination to impact
NICTA Team
• Anna Liu• Alan Fekete• Min Fu• Jim Zhanwen Li• Qinghua Lu• Hiroshi Wada• Ingo Weber• Xiwei Xu• Liming Zhu
47