View
225
Download
1
Category
Preview:
Citation preview
Availability & Related Availability & Related ConceptsConceptsHigh AvailabilityHigh Availability
Calculation of AvailabilityCalculation of AvailabilityComponents of AvailabilityComponents of Availability
CharacteristicsCharacteristicsRulesRules
Seven ‘R’sSeven ‘R’s
Dr. Neelu J. Ahuja, College of Engineering Studies
Knowing Availability……Knowing Availability…… The percentage of total time that a Network, The percentage of total time that a Network,
System, Service is available for use.System, Service is available for use. What is HA? { High Availability}What is HA? { High Availability} A A Network System ServiceNetwork System Service
with specific design elements intended to keep with specific design elements intended to keep availability above a high threshold. (eg: availability above a high threshold. (eg: 99.99%)99.99%)
High Availability….High Availability…. It is a measure of probability that a service is It is a measure of probability that a service is
available for use at any given instant.available for use at any given instant. It is considered that the system is “highly available” It is considered that the system is “highly available”
if it has a uptime of 5 nines (99.999%).if it has a uptime of 5 nines (99.999%). Availability is basically a function of Availability is basically a function of
System ReliabilitySystem Reliability System ReparabilitySystem Reparability System RedundancySystem Redundancy
RR to achieve HA…RR to achieve HA… There are Rapid Recovery systems to achieve There are Rapid Recovery systems to achieve
High Availability.High Availability. A Network, System, Service with specific A Network, System, Service with specific
design elements intended to recover from design elements intended to recover from down time very quickly.down time very quickly.
The delay is as small as possible so that the The delay is as small as possible so that the user does not face inconvenience.user does not face inconvenience.
The time may vary depending on the kind of The time may vary depending on the kind of application.application.
What’s System Reliability?What’s System Reliability? It is measure of continuous system uptime in It is measure of continuous system uptime in
the absence of IT failure.the absence of IT failure. System is said to be “highly reliable” if it has a System is said to be “highly reliable” if it has a
high mean time between failures (MTBF).high mean time between failures (MTBF). It is also called MTBSIF. It is also called MTBSIF. Mean time between service impacting failures.Mean time between service impacting failures. It is typically a term taken from telecom It is typically a term taken from telecom
industry.industry.
What’s System Reparability?What’s System Reparability? It is measure of how quickly a failed device or It is measure of how quickly a failed device or
a system can be restored to service.a system can be restored to service. Reparability is measured in mean time to Reparability is measured in mean time to
repair.repair. It is represented as MTTR.It is represented as MTTR. The less reliable the system the more the need The less reliable the system the more the need
to have a low MTTR to support overall system to have a low MTTR to support overall system availability.availability.
What’s System Redundancy?What’s System Redundancy? Redundancy augments the reparability of Redundancy augments the reparability of
individual components by establishing a individual components by establishing a backup or stand by.backup or stand by.
This means that there are multiple resources This means that there are multiple resources providing the same service.providing the same service.
Effectiveness of redundancy is a function of Effectiveness of redundancy is a function of how quickly a backup component can be how quickly a backup component can be brought into service.brought into service.
AvailabilityAvailability
Availability=MTBFAvailability=MTBF MTBF+MTTRMTBF+MTTR
High Availability=High MTBF or Low MTTRHigh Availability=High MTBF or Low MTTR
Calculating AvailabilityCalculating Availability Availability can be measured directly through Availability can be measured directly through
periodic polling.periodic polling. Polling is a communications control method used Polling is a communications control method used
by some computer/terminal systems whereby a by some computer/terminal systems whereby a "master" station asks many devices attached to a "master" station asks many devices attached to a common transmission medium, in turn, whether common transmission medium, in turn, whether they have information to send. they have information to send.
Some common methods used are SNMP, Some common methods used are SNMP, NAGIOS etc.NAGIOS etc.
SNMP= Simple network management protocol.SNMP= Simple network management protocol.
Using SNMP…Using SNMP… Simple Network Management Protocol (SNMP)Simple Network Management Protocol (SNMP) It is an application layer protocol that facilitates the It is an application layer protocol that facilitates the
exchange of management information between exchange of management information between network devices.network devices.
It is part of the Transmission Control It is part of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite.Protocol/Internet Protocol (TCP/IP) protocol suite.
SNMP enables network administrators to manage SNMP enables network administrators to manage network performance, find and solve network network performance, find and solve network problems, and plan for network growth. problems, and plan for network growth.
Nagios…..Nagios….. Nagios is a host and service monitor designed to Nagios is a host and service monitor designed to
inform administrator of the network problems before inform administrator of the network problems before clients, end-users or managers encounter.clients, end-users or managers encounter.
It was initially designed to run under the It was initially designed to run under the Linux operating system, but works fine under most , but works fine under most *NIX variants as well.*NIX variants as well.
The monitoring daemon runs intermittent checks on The monitoring daemon runs intermittent checks on hosts and services which return status information to hosts and services which return status information to Nagios. Nagios.
When problems are encountered, the daemon can When problems are encountered, the daemon can send notifications out to administrative contacts in a send notifications out to administrative contacts in a variety of different ways (email, instant message, variety of different ways (email, instant message, SMS, etc.). SMS, etc.).
Calculating Availability…Calculating Availability… These are used basically to measure the These are used basically to measure the
availability of the system in question.availability of the system in question. A formula for predicting availability of the A formula for predicting availability of the
single component is single component is MTBFMTBF MTBF+MTTRMTBF+MTTR or TTRor TTR (MTBF+TTR)(MTBF+TTR)
1-
Components of Availability…Components of Availability… Data Centre FacilityData Centre Facility Server Hardware (Processor, Memory, Server Hardware (Processor, Memory,
Communication channels)Communication channels) Server System Software (OS, Program Server System Software (OS, Program
Products like SDK, Web Sphere software, Products like SDK, Web Sphere software, SOA Packs)SOA Packs)
Application Software (Programs, DBMS etc)Application Software (Programs, DBMS etc) Disk Hardware (Data files, Control files)Disk Hardware (Data files, Control files)
Contd….Contd…. Database software (data files, data dictionary files)Database software (data files, data dictionary files) Network Software (Protocols, network component Network Software (Protocols, network component
drivers)drivers) Network Hardware (Controllers, Hubs, Routers, Network Hardware (Controllers, Hubs, Routers,
Repeaters, Modems etc).Repeaters, Modems etc). Desktop Software (Office suit, General Purpose Desktop Software (Office suit, General Purpose
Applications)Applications) Desktop Hardware (Processors, Menu, disk, Interface Desktop Hardware (Processors, Menu, disk, Interface
cards).cards).
For Good Availability….For Good Availability….CharacteristicCharacteristic PriorityPriority
Knowledge of Systems Software & ComponentsKnowledge of Systems Software & Components
HighHigh
Knowledge of Network Software & ComponentsKnowledge of Network Software & Components
Knowledge of Database SystemsKnowledge of Database Systems
Knowledge of Power & Air Conditioning SystemsKnowledge of Power & Air Conditioning Systems
Ability to think & act tacticallyAbility to think & act tactically
Knowledge of Software ConfigurationKnowledge of Software Configuration
MediumMediumKnowledge of Hardware ConfigurationKnowledge of Hardware Configuration
Knowledge of Backup SystemKnowledge of Backup System
Knowledge of Desktop Hardware & SoftwareKnowledge of Desktop Hardware & Software
Knowledge of ApplicationsKnowledge of Applications LowLowAbility to Work Effectively with DevelopersAbility to Work Effectively with Developers
Ability to communicate effectively with IT ExecutivesAbility to communicate effectively with IT Executives
Ability to manage diversityAbility to manage diversity
Ability to think & plan strategicallyAbility to think & plan strategically
Rules of Nine Availability..Rules of Nine Availability..No of No of NinesNines
% of % of AvailabilityAvailability
Weekly Weekly Hours Hours DownDown
Weekly Weekly Minutes Minutes DownDown
Missed Missed Event out Event out of 10,000of 10,000
Missed Missed event out event out of of 1000,0001000,000
11 90.00090.000 10.00010.000 600.00600.00 1000.01000.0 100,000100,000
22 99.00099.000 01.00001.000 060.00060.00 0100.00100.0 010,000010,000
33 99.90099.900 00.10000.100 006.00006.00 0010.00010.0 001,000001,000
44 99.99099.990 00.01000.010 000.60000.60 0001.00001.0 000,100000,100
55 99.99999.999 00.00100.001 000.06000.06 0000.10000.1 000,010000,010
Business Continuity….Business Continuity…. During Unavailability situations, the biggest target is During Unavailability situations, the biggest target is
to keep the business going.to keep the business going. This is technically referred to as Business Continuity.This is technically referred to as Business Continuity. Business Continuity is the preparation for, response Business Continuity is the preparation for, response
to, and recovery from an application outage that to, and recovery from an application outage that adversely affects business operations.adversely affects business operations.
Business Continuity Solutions address systems Business Continuity Solutions address systems unavailability, degraded application performance, or unavailability, degraded application performance, or unacceptable recovery strategiesunacceptable recovery strategies
Why Assuring Availability….Why Assuring Availability….Lost RevenueKnow the downtime costs
(per hour, day, two days...)• Number of employees impacted (x hours out * hourly rate)
Damaged Reputation
• Customers• Suppliers• Financial markets• Banks• Business partners
Financial Performance
• Revenue recognition• Cash flow• Lost discounts (A/P)• Payment guarantees• Credit rating• Stock price
Other ExpensesTemporary employees, equipment rental, overtime costs, extra shipping costs, travel expenses...
• Direct loss• Compensatory payments• Lost future revenue• Billing losses• Investment losses
Lost Productivity
Statistics to show loss due to Statistics to show loss due to unavailability…eg: Loss in US Dollars in unavailability…eg: Loss in US Dollars in
MillionsMillions
6.53.6
2.82.6
2.01.61.6
1.51.3
1.21.1
Retail brokerage
Point of sale
Energy
Credit card sales authorization
Telecommunications
Call location
Manufacturing
Financial institutions
Information technology
Insurance
Retail
Source Meta Group, 2005
Very crucial is data availability on Very crucial is data availability on systems…..systems…..
Disruptors of Data availability…..Disruptors of Data availability…..Disaster (<1% of Occurrences)Natural or man made
Flood, fire, earthquakeContaminated building
Unplanned Occurrences (13% of Occurrences)
FailureDatabase corruptionComponent failureHuman error
Planned Occurrences (87% of Occurrences)Competing workloads
Backup, reportingData warehouse extractsApplication and data restore
To Handle Business continuity To Handle Business continuity during Unavailability BCP is during Unavailability BCP is
done….done….
BCP stands for Business Continuity Planning.BCP stands for Business Continuity Planning. It is a plan created to ensure business continuity in adverse It is a plan created to ensure business continuity in adverse
situations.situations. Broad Objectives are:Broad Objectives are: 1. Identifying the mission or critical business functions 1. Identifying the mission or critical business functions 2. Collecting data on current business processes.2. Collecting data on current business processes. 3. Assessing, prioritizing, mitigating, and managing risk3. Assessing, prioritizing, mitigating, and managing risk
Risk AnalysisRisk Analysis Business Impact Analysis (BIA) Business Impact Analysis (BIA)
Designing and developing contingency plans and disaster Designing and developing contingency plans and disaster recovery plan (DR Plan)recovery plan (DR Plan)
Training, testing, and maintenance Training, testing, and maintenance
Seven ‘R’s of ensuring High Seven ‘R’s of ensuring High Availability….Availability….
RedundancyRedundancy ReputationReputation ReliabilityReliability ReparabilityReparability RecoverabilityRecoverability ResponsivenessResponsiveness RobustnessRobustness
Redundancy…Redundancy… Redundancy augments the reparability of Redundancy augments the reparability of
individual components by establishing a individual components by establishing a backup or stand by.backup or stand by.
This means that there are multiple resources This means that there are multiple resources providing the same service.providing the same service.
Effectiveness of redundancy is a function of Effectiveness of redundancy is a function of how quickly a backup component can be how quickly a backup component can be brought into service.brought into service.
Reputation..Reputation.. The organization gets its reputation from the The organization gets its reputation from the
availability standards that it maintains.availability standards that it maintains. The unavailability would serious affect the The unavailability would serious affect the
reputation of the business.reputation of the business. The damage to reputation may have serious The damage to reputation may have serious
cascading affects on the overall setup.cascading affects on the overall setup.
ReliabilityReliability The reliability of hardware and software can be verified from The reliability of hardware and software can be verified from
customer references and industry analysts.customer references and industry analysts. Beyond that, there should be considered the performing of an Beyond that, there should be considered the performing of an
Empirical Component Reliability AnalysisEmpirical Component Reliability Analysis, which , which consists of the following steps:consists of the following steps:
Review and analyze problem management logs.Review and analyze problem management logs. Review and analyze supplier logs.Review and analyze supplier logs. Acquire feedback from operations personnel.Acquire feedback from operations personnel. Acquire feedback from support personnel.Acquire feedback from support personnel. Acquire feedback from supplier repair personnel.Acquire feedback from supplier repair personnel. Compare experiences with other shops.Compare experiences with other shops. Study reports from industry analysts.Study reports from industry analysts.
More…More… An analysis of problem logs should reveal any An analysis of problem logs should reveal any
unusual patterns of failure;unusual patterns of failure; It should be studied by supplier, and product It should be studied by supplier, and product
using department, considering the details of using department, considering the details of day and time of failures, frequency of failures, day and time of failures, frequency of failures, and time to repair.and time to repair.
Suppliers often keep onsite repair logs that can Suppliers often keep onsite repair logs that can be perused to conduct a similar analysis.be perused to conduct a similar analysis.
More….More…. Feedback from operations personnel—especially Feedback from operations personnel—especially
offsite operators—is often candid, and can be offsite operators—is often candid, and can be revealing as to how components truly perform.revealing as to how components truly perform.
For example, operators may be doing numerous resets For example, operators may be doing numerous resets on a particular network component every morning on a particular network component every morning prior to startup, but they may not bother to log these prior to startup, but they may not bother to log these activities since the network always comes up.activities since the network always comes up.
Similar conversations with various support personnel Similar conversations with various support personnel such as systems administrators, network such as systems administrators, network administrators, and database administrators may elicit administrators, and database administrators may elicit similar revelations.similar revelations.
More….More…. There may be bias when canvassing a supplier's repair There may be bias when canvassing a supplier's repair
personnel about the true reliability of their products.personnel about the true reliability of their products. But these people can be just as candid and revealing as the But these people can be just as candid and revealing as the
people using the product.people using the product. This becomes another valuable source of information for This becomes another valuable source of information for
evaluating component reliability.evaluating component reliability. Yet another is comparing experiences with other shops or Yet another is comparing experiences with other shops or
setups.setups. Shops that are closely aligned with the organizations way of Shops that are closely aligned with the organizations way of
working and are using similar platforms, configurations, and working and are using similar platforms, configurations, and offering similar services. offering similar services.
Even customers can be especially helpful.Even customers can be especially helpful. Reports from reputable industry analysts can also be used to Reports from reputable industry analysts can also be used to
predict component reliability.predict component reliability.
ReparabilityReparability Reparability is the relative ease with which service Reparability is the relative ease with which service
technicians can resolve or replace failing technicians can resolve or replace failing components.components.
Two common metrics used to evaluate this trait are Two common metrics used to evaluate this trait are how long it takes to do the actual repair, and how how long it takes to do the actual repair, and how often the repair work needs to be repeated.often the repair work needs to be repeated.
In more sophisticated systems, initial repair work can In more sophisticated systems, initial repair work can be done from remote diagnostic centers where be done from remote diagnostic centers where failures are detected, circumvented, and arrangements failures are detected, circumvented, and arrangements made for permanent resolution with little or no made for permanent resolution with little or no involvement of operations personnel. involvement of operations personnel.
RecoverabilityRecoverability This refers to the ability to overcome a momentary This refers to the ability to overcome a momentary
failure in such a way that there is no impact on end-failure in such a way that there is no impact on end-user availability.user availability.
It could be as small as a portion of main memory It could be as small as a portion of main memory recovering from a single-bit memory error, and as recovering from a single-bit memory error, and as large as having an entire server system switch over to large as having an entire server system switch over to its standby system with no loss of data or its standby system with no loss of data or transactions.transactions.
Recoverability also includes retries of attempted Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as reads and writes out to disk or tape, as well as retrying of transmissions down network lines.retrying of transmissions down network lines.
Responsiveness…Responsiveness…
.. Responsiveness is the sense of urgency that all people Responsiveness is the sense of urgency that all people
involved with high availability need to exhibit. involved with high availability need to exhibit. This includes having well-trained suppliers and in-This includes having well-trained suppliers and in-
house support personnel who can respond to house support personnel who can respond to problems quickly and efficiently.problems quickly and efficiently.
It also pertains to how quickly the automated It also pertains to how quickly the automated recovery of resources such as disks or servers can be recovery of resources such as disks or servers can be enacted.enacted.
RobustnessRobustness A robust process is able to withstand a variety of A robust process is able to withstand a variety of
forces—both internal and external—that could easily forces—both internal and external—that could easily disrupt and undermine availability in a weaker disrupt and undermine availability in a weaker environment.environment.
Robustness puts a high premium on documentation Robustness puts a high premium on documentation and training to withstand technical changes related to and training to withstand technical changes related to platforms, products, services, and customers; platforms, products, services, and customers; personnel changes related to turnover, expansion, and personnel changes related to turnover, expansion, and rotation; and business changes related to new rotation; and business changes related to new direction, acquisitions, and mergers.direction, acquisitions, and mergers.
10 Factors to evaluate System 10 Factors to evaluate System Availability…Availability…
1.Executive support:1.Executive support: The Management Supports and sponsors the The Management Supports and sponsors the
availability with actions such as analyzing outage availability with actions such as analyzing outage reports and holding groups accountable.reports and holding groups accountable.
2. Process Owner:2. Process Owner: Process owner is the person who has initiated Process owner is the person who has initiated
the process in the system. It is necessary on his part to the process in the system. It is necessary on his part to ensure timely and accurate analysis of Distribution of ensure timely and accurate analysis of Distribution of outage reports.outage reports.
Contd…Contd… 3. Customer Involvement:3. Customer Involvement: It is proved that the processes It is proved that the processes
which involve the basic understanding of the which involve the basic understanding of the customer needs are more successful than the customer needs are more successful than the others which don’t. Customer Involvement in others which don’t. Customer Involvement in the design and use of the processes plays a the design and use of the processes plays a vital role to indicate availability.vital role to indicate availability.
Contd…Contd… 4. Supplier Involvement:4. Supplier Involvement: Involvement of key suppliers of Involvement of key suppliers of
hardware, software, service providers is hardware, software, service providers is necessary.necessary.
5. Service Metrics:5. Service Metrics: Analysis of metrics for trends such Analysis of metrics for trends such
as percentage of down time and value of time as percentage of down time and value of time lost due to outages.lost due to outages.
Contd…Contd… 6. Process Metrics:6. Process Metrics: Extent to which process metrics are Extent to which process metrics are
analyzed for trends such as ease of quickness analyzed for trends such as ease of quickness with which servers can be rebooted.with which servers can be rebooted.
7. Process Integration:7. Process Integration: The degree to which availability The degree to which availability
process integrates with other processes and process integrates with other processes and tools such as overall network management.tools such as overall network management.
Contd…Contd… 8. Streamlining or automation:8. Streamlining or automation: The extent to which availability The extent to which availability
process is streamlined by automating actions process is streamlined by automating actions such as generation and notification or issuing such as generation and notification or issuing of outage tickets.of outage tickets.
9. Training of Staff:9. Training of Staff: Training of staff on availability Training of staff on availability
process.process.
Contd…Contd… 10. Process Documentation:10. Process Documentation: Quality and value of availability Quality and value of availability
documentation measured and maintained for documentation measured and maintained for future analysis.future analysis.
Recommended