38
Availability & Availability & Related Concepts Related Concepts High Availability High Availability Calculation of Availability Calculation of Availability Components of Availability Components of Availability Characteristics Characteristics Rules Rules Seven ‘R’s Seven ‘R’s Dr. Neelu J. Ahuja, College of Engineering Studies

Unit 5 Availability Related Concepts

  • Upload
    durga85

  • View
    224

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Unit 5 Availability Related Concepts

Availability & Related Availability & Related ConceptsConceptsHigh AvailabilityHigh Availability

Calculation of AvailabilityCalculation of AvailabilityComponents of AvailabilityComponents of Availability

CharacteristicsCharacteristicsRulesRules

Seven ‘R’sSeven ‘R’s

Dr. Neelu J. Ahuja, College of Engineering Studies

Page 2: Unit 5 Availability Related Concepts

Knowing Availability……Knowing Availability…… The percentage of total time that a Network, The percentage of total time that a Network,

System, Service is available for use.System, Service is available for use. What is HA? { High Availability}What is HA? { High Availability} A A Network System ServiceNetwork System Service

with specific design elements intended to keep with specific design elements intended to keep availability above a high threshold. (eg: availability above a high threshold. (eg: 99.99%)99.99%)

Page 3: Unit 5 Availability Related Concepts

High Availability….High Availability…. It is a measure of probability that a service is It is a measure of probability that a service is

available for use at any given instant.available for use at any given instant. It is considered that the system is “highly available” It is considered that the system is “highly available”

if it has a uptime of 5 nines (99.999%).if it has a uptime of 5 nines (99.999%). Availability is basically a function of Availability is basically a function of

System ReliabilitySystem Reliability System ReparabilitySystem Reparability System RedundancySystem Redundancy

Page 4: Unit 5 Availability Related Concepts

RR to achieve HA…RR to achieve HA… There are Rapid Recovery systems to achieve There are Rapid Recovery systems to achieve

High Availability.High Availability. A Network, System, Service with specific A Network, System, Service with specific

design elements intended to recover from design elements intended to recover from down time very quickly.down time very quickly.

The delay is as small as possible so that the The delay is as small as possible so that the user does not face inconvenience.user does not face inconvenience.

The time may vary depending on the kind of The time may vary depending on the kind of application.application.

Page 5: Unit 5 Availability Related Concepts

What’s System Reliability?What’s System Reliability? It is measure of continuous system uptime in It is measure of continuous system uptime in

the absence of IT failure.the absence of IT failure. System is said to be “highly reliable” if it has a System is said to be “highly reliable” if it has a

high mean time between failures (MTBF).high mean time between failures (MTBF). It is also called MTBSIF. It is also called MTBSIF. Mean time between service impacting failures.Mean time between service impacting failures. It is typically a term taken from telecom It is typically a term taken from telecom

industry.industry.

Page 6: Unit 5 Availability Related Concepts

What’s System Reparability?What’s System Reparability? It is measure of how quickly a failed device or It is measure of how quickly a failed device or

a system can be restored to service.a system can be restored to service. Reparability is measured in mean time to Reparability is measured in mean time to

repair.repair. It is represented as MTTR.It is represented as MTTR. The less reliable the system the more the need The less reliable the system the more the need

to have a low MTTR to support overall system to have a low MTTR to support overall system availability.availability.

Page 7: Unit 5 Availability Related Concepts

What’s System Redundancy?What’s System Redundancy? Redundancy augments the reparability of Redundancy augments the reparability of

individual components by establishing a individual components by establishing a backup or stand by.backup or stand by.

This means that there are multiple resources This means that there are multiple resources providing the same service.providing the same service.

Effectiveness of redundancy is a function of Effectiveness of redundancy is a function of how quickly a backup component can be how quickly a backup component can be brought into service.brought into service.

Page 8: Unit 5 Availability Related Concepts

AvailabilityAvailability

Availability=MTBFAvailability=MTBF MTBF+MTTRMTBF+MTTR

High Availability=High MTBF or Low MTTRHigh Availability=High MTBF or Low MTTR

Page 9: Unit 5 Availability Related Concepts

Calculating AvailabilityCalculating Availability Availability can be measured directly through Availability can be measured directly through

periodic polling.periodic polling. Polling is a communications control method used Polling is a communications control method used

by some computer/terminal systems whereby a by some computer/terminal systems whereby a "master" station asks many devices attached to a "master" station asks many devices attached to a common transmission medium, in turn, whether common transmission medium, in turn, whether they have information to send. they have information to send.

Some common methods used are SNMP, Some common methods used are SNMP, NAGIOS etc.NAGIOS etc.

SNMP= Simple network management protocol.SNMP= Simple network management protocol.

Page 10: Unit 5 Availability Related Concepts

Using SNMP…Using SNMP… Simple Network Management Protocol (SNMP)Simple Network Management Protocol (SNMP) It is an application layer protocol that facilitates the It is an application layer protocol that facilitates the

exchange of management information between exchange of management information between network devices.network devices.

It is part of the Transmission Control It is part of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol suite.Protocol/Internet Protocol (TCP/IP) protocol suite.

SNMP enables network administrators to manage SNMP enables network administrators to manage network performance, find and solve network network performance, find and solve network problems, and plan for network growth. problems, and plan for network growth.

Page 11: Unit 5 Availability Related Concepts

Nagios…..Nagios….. Nagios is a host and service monitor designed to Nagios is a host and service monitor designed to

inform administrator of the network problems before inform administrator of the network problems before clients, end-users or managers encounter.clients, end-users or managers encounter.

It was initially designed to run under the It was initially designed to run under the Linux operating system, but works fine under most , but works fine under most *NIX variants as well.*NIX variants as well.

The monitoring daemon runs intermittent checks on The monitoring daemon runs intermittent checks on hosts and services which return status information to hosts and services which return status information to Nagios. Nagios.

When problems are encountered, the daemon can When problems are encountered, the daemon can send notifications out to administrative contacts in a send notifications out to administrative contacts in a variety of different ways (email, instant message, variety of different ways (email, instant message, SMS, etc.). SMS, etc.).

Page 12: Unit 5 Availability Related Concepts

Calculating Availability…Calculating Availability… These are used basically to measure the These are used basically to measure the

availability of the system in question.availability of the system in question. A formula for predicting availability of the A formula for predicting availability of the

single component is single component is MTBFMTBF MTBF+MTTRMTBF+MTTR or TTRor TTR (MTBF+TTR)(MTBF+TTR)

1-

Page 13: Unit 5 Availability Related Concepts

Components of Availability…Components of Availability… Data Centre FacilityData Centre Facility Server Hardware (Processor, Memory, Server Hardware (Processor, Memory,

Communication channels)Communication channels) Server System Software (OS, Program Server System Software (OS, Program

Products like SDK, Web Sphere software, Products like SDK, Web Sphere software, SOA Packs)SOA Packs)

Application Software (Programs, DBMS etc)Application Software (Programs, DBMS etc) Disk Hardware (Data files, Control files)Disk Hardware (Data files, Control files)

Page 14: Unit 5 Availability Related Concepts

Contd….Contd…. Database software (data files, data dictionary files)Database software (data files, data dictionary files) Network Software (Protocols, network component Network Software (Protocols, network component

drivers)drivers) Network Hardware (Controllers, Hubs, Routers, Network Hardware (Controllers, Hubs, Routers,

Repeaters, Modems etc).Repeaters, Modems etc). Desktop Software (Office suit, General Purpose Desktop Software (Office suit, General Purpose

Applications)Applications) Desktop Hardware (Processors, Menu, disk, Interface Desktop Hardware (Processors, Menu, disk, Interface

cards).cards).

Page 15: Unit 5 Availability Related Concepts

For Good Availability….For Good Availability….CharacteristicCharacteristic PriorityPriority

Knowledge of Systems Software & ComponentsKnowledge of Systems Software & Components

HighHigh

Knowledge of Network Software & ComponentsKnowledge of Network Software & Components

Knowledge of Database SystemsKnowledge of Database Systems

Knowledge of Power & Air Conditioning SystemsKnowledge of Power & Air Conditioning Systems

Ability to think & act tacticallyAbility to think & act tactically

Knowledge of Software ConfigurationKnowledge of Software Configuration

MediumMediumKnowledge of Hardware ConfigurationKnowledge of Hardware Configuration

Knowledge of Backup SystemKnowledge of Backup System

Knowledge of Desktop Hardware & SoftwareKnowledge of Desktop Hardware & Software

Knowledge of ApplicationsKnowledge of Applications LowLowAbility to Work Effectively with DevelopersAbility to Work Effectively with Developers

Ability to communicate effectively with IT ExecutivesAbility to communicate effectively with IT Executives

Ability to manage diversityAbility to manage diversity

Ability to think & plan strategicallyAbility to think & plan strategically

Page 16: Unit 5 Availability Related Concepts

Rules of Nine Availability..Rules of Nine Availability..No of No of NinesNines

% of % of AvailabilityAvailability

Weekly Weekly Hours Hours DownDown

Weekly Weekly Minutes Minutes DownDown

Missed Missed Event out Event out of 10,000of 10,000

Missed Missed event out event out of of 1000,0001000,000

11 90.00090.000 10.00010.000 600.00600.00 1000.01000.0 100,000100,000

22 99.00099.000 01.00001.000 060.00060.00 0100.00100.0 010,000010,000

33 99.90099.900 00.10000.100 006.00006.00 0010.00010.0 001,000001,000

44 99.99099.990 00.01000.010 000.60000.60 0001.00001.0 000,100000,100

55 99.99999.999 00.00100.001 000.06000.06 0000.10000.1 000,010000,010

Page 17: Unit 5 Availability Related Concepts

Business Continuity….Business Continuity…. During Unavailability situations, the biggest target is During Unavailability situations, the biggest target is

to keep the business going.to keep the business going. This is technically referred to as Business Continuity.This is technically referred to as Business Continuity. Business Continuity is the preparation for, response Business Continuity is the preparation for, response

to, and recovery from an application outage that to, and recovery from an application outage that adversely affects business operations.adversely affects business operations.

Business Continuity Solutions address systems Business Continuity Solutions address systems unavailability, degraded application performance, or unavailability, degraded application performance, or unacceptable recovery strategiesunacceptable recovery strategies

Page 18: Unit 5 Availability Related Concepts

Why Assuring Availability….Why Assuring Availability….Lost RevenueKnow the downtime costs

(per hour, day, two days...)• Number of employees impacted (x hours out * hourly rate)

Damaged Reputation

• Customers• Suppliers• Financial markets• Banks• Business partners

Financial Performance

• Revenue recognition• Cash flow• Lost discounts (A/P)• Payment guarantees• Credit rating• Stock price

Other ExpensesTemporary employees, equipment rental, overtime costs, extra shipping costs, travel expenses...

• Direct loss• Compensatory payments• Lost future revenue• Billing losses• Investment losses

Lost Productivity

Page 19: Unit 5 Availability Related Concepts

Statistics to show loss due to Statistics to show loss due to unavailability…eg: Loss in US Dollars in unavailability…eg: Loss in US Dollars in

MillionsMillions

6.53.6

2.82.6

2.01.61.6

1.51.3

1.21.1

Retail brokerage

Point of sale

Energy

Credit card sales authorization

Telecommunications

Call location

Manufacturing

Financial institutions

Information technology

Insurance

Retail

Source Meta Group, 2005

Page 20: Unit 5 Availability Related Concepts

Very crucial is data availability on Very crucial is data availability on systems…..systems…..

Disruptors of Data availability…..Disruptors of Data availability…..Disaster (<1% of Occurrences)Natural or man made

Flood, fire, earthquakeContaminated building

Unplanned Occurrences (13% of Occurrences)

FailureDatabase corruptionComponent failureHuman error

Planned Occurrences (87% of Occurrences)Competing workloads

Backup, reportingData warehouse extractsApplication and data restore

Page 21: Unit 5 Availability Related Concepts

To Handle Business continuity To Handle Business continuity during Unavailability BCP is during Unavailability BCP is

done….done….

BCP stands for Business Continuity Planning.BCP stands for Business Continuity Planning. It is a plan created to ensure business continuity in adverse It is a plan created to ensure business continuity in adverse

situations.situations. Broad Objectives are:Broad Objectives are: 1. Identifying the mission or critical business functions 1. Identifying the mission or critical business functions 2. Collecting data on current business processes.2. Collecting data on current business processes. 3. Assessing, prioritizing, mitigating, and managing risk3. Assessing, prioritizing, mitigating, and managing risk

Risk AnalysisRisk Analysis Business Impact Analysis (BIA) Business Impact Analysis (BIA)

Designing and developing contingency plans and disaster Designing and developing contingency plans and disaster recovery plan (DR Plan)recovery plan (DR Plan)

Training, testing, and maintenance Training, testing, and maintenance

Page 22: Unit 5 Availability Related Concepts

Seven ‘R’s of ensuring High Seven ‘R’s of ensuring High Availability….Availability….

RedundancyRedundancy ReputationReputation ReliabilityReliability ReparabilityReparability RecoverabilityRecoverability ResponsivenessResponsiveness RobustnessRobustness

Page 23: Unit 5 Availability Related Concepts

Redundancy…Redundancy… Redundancy augments the reparability of Redundancy augments the reparability of

individual components by establishing a individual components by establishing a backup or stand by.backup or stand by.

This means that there are multiple resources This means that there are multiple resources providing the same service.providing the same service.

Effectiveness of redundancy is a function of Effectiveness of redundancy is a function of how quickly a backup component can be how quickly a backup component can be brought into service.brought into service.

Page 24: Unit 5 Availability Related Concepts

Reputation..Reputation.. The organization gets its reputation from the The organization gets its reputation from the

availability standards that it maintains.availability standards that it maintains. The unavailability would serious affect the The unavailability would serious affect the

reputation of the business.reputation of the business. The damage to reputation may have serious The damage to reputation may have serious

cascading affects on the overall setup.cascading affects on the overall setup.

Page 25: Unit 5 Availability Related Concepts

ReliabilityReliability The reliability of hardware and software can be verified from The reliability of hardware and software can be verified from

customer references and industry analysts.customer references and industry analysts. Beyond that, there should be considered the performing of an Beyond that, there should be considered the performing of an

Empirical Component Reliability AnalysisEmpirical Component Reliability Analysis, which , which consists of the following steps:consists of the following steps:

Review and analyze problem management logs.Review and analyze problem management logs. Review and analyze supplier logs.Review and analyze supplier logs. Acquire feedback from operations personnel.Acquire feedback from operations personnel. Acquire feedback from support personnel.Acquire feedback from support personnel. Acquire feedback from supplier repair personnel.Acquire feedback from supplier repair personnel. Compare experiences with other shops.Compare experiences with other shops. Study reports from industry analysts.Study reports from industry analysts.

Page 26: Unit 5 Availability Related Concepts

More…More… An analysis of problem logs should reveal any An analysis of problem logs should reveal any

unusual patterns of failure;unusual patterns of failure; It should be studied by supplier, and product It should be studied by supplier, and product

using department, considering the details of using department, considering the details of day and time of failures, frequency of failures, day and time of failures, frequency of failures, and time to repair.and time to repair.

Suppliers often keep onsite repair logs that can Suppliers often keep onsite repair logs that can be perused to conduct a similar analysis.be perused to conduct a similar analysis.

Page 27: Unit 5 Availability Related Concepts

More….More…. Feedback from operations personnel—especially Feedback from operations personnel—especially

offsite operators—is often candid, and can be offsite operators—is often candid, and can be revealing as to how components truly perform.revealing as to how components truly perform.

For example, operators may be doing numerous resets For example, operators may be doing numerous resets on a particular network component every morning on a particular network component every morning prior to startup, but they may not bother to log these prior to startup, but they may not bother to log these activities since the network always comes up.activities since the network always comes up.

Similar conversations with various support personnel Similar conversations with various support personnel such as systems administrators, network such as systems administrators, network administrators, and database administrators may elicit administrators, and database administrators may elicit similar revelations.similar revelations.

Page 28: Unit 5 Availability Related Concepts

More….More…. There may be bias when canvassing a supplier's repair There may be bias when canvassing a supplier's repair

personnel about the true reliability of their products.personnel about the true reliability of their products. But these people can be just as candid and revealing as the But these people can be just as candid and revealing as the

people using the product.people using the product. This becomes another valuable source of information for This becomes another valuable source of information for

evaluating component reliability.evaluating component reliability. Yet another is comparing experiences with other shops or Yet another is comparing experiences with other shops or

setups.setups. Shops that are closely aligned with the organizations way of Shops that are closely aligned with the organizations way of

working and are using similar platforms, configurations, and working and are using similar platforms, configurations, and offering similar services. offering similar services.

Even customers can be especially helpful.Even customers can be especially helpful. Reports from reputable industry analysts can also be used to Reports from reputable industry analysts can also be used to

predict component reliability.predict component reliability.

Page 29: Unit 5 Availability Related Concepts

ReparabilityReparability Reparability is the relative ease with which service Reparability is the relative ease with which service

technicians can resolve or replace failing technicians can resolve or replace failing components.components.

Two common metrics used to evaluate this trait are Two common metrics used to evaluate this trait are how long it takes to do the actual repair, and how how long it takes to do the actual repair, and how often the repair work needs to be repeated.often the repair work needs to be repeated.

In more sophisticated systems, initial repair work can In more sophisticated systems, initial repair work can be done from remote diagnostic centers where be done from remote diagnostic centers where failures are detected, circumvented, and arrangements failures are detected, circumvented, and arrangements made for permanent resolution with little or no made for permanent resolution with little or no involvement of operations personnel. involvement of operations personnel.

Page 30: Unit 5 Availability Related Concepts

RecoverabilityRecoverability This refers to the ability to overcome a momentary This refers to the ability to overcome a momentary

failure in such a way that there is no impact on end-failure in such a way that there is no impact on end-user availability.user availability.

It could be as small as a portion of main memory It could be as small as a portion of main memory recovering from a single-bit memory error, and as recovering from a single-bit memory error, and as large as having an entire server system switch over to large as having an entire server system switch over to its standby system with no loss of data or its standby system with no loss of data or transactions.transactions.

Recoverability also includes retries of attempted Recoverability also includes retries of attempted reads and writes out to disk or tape, as well as reads and writes out to disk or tape, as well as retrying of transmissions down network lines.retrying of transmissions down network lines.

Page 31: Unit 5 Availability Related Concepts

Responsiveness…Responsiveness…

.. Responsiveness is the sense of urgency that all people Responsiveness is the sense of urgency that all people

involved with high availability need to exhibit. involved with high availability need to exhibit. This includes having well-trained suppliers and in-This includes having well-trained suppliers and in-

house support personnel who can respond to house support personnel who can respond to problems quickly and efficiently.problems quickly and efficiently.

It also pertains to how quickly the automated It also pertains to how quickly the automated recovery of resources such as disks or servers can be recovery of resources such as disks or servers can be enacted.enacted.

Page 32: Unit 5 Availability Related Concepts

RobustnessRobustness A robust process is able to withstand a variety of A robust process is able to withstand a variety of

forces—both internal and external—that could easily forces—both internal and external—that could easily disrupt and undermine availability in a weaker disrupt and undermine availability in a weaker environment.environment.

Robustness puts a high premium on documentation Robustness puts a high premium on documentation and training to withstand technical changes related to and training to withstand technical changes related to platforms, products, services, and customers; platforms, products, services, and customers; personnel changes related to turnover, expansion, and personnel changes related to turnover, expansion, and rotation; and business changes related to new rotation; and business changes related to new direction, acquisitions, and mergers.direction, acquisitions, and mergers.

Page 33: Unit 5 Availability Related Concepts

10 Factors to evaluate System 10 Factors to evaluate System Availability…Availability…

1.Executive support:1.Executive support: The Management Supports and sponsors the The Management Supports and sponsors the

availability with actions such as analyzing outage availability with actions such as analyzing outage reports and holding groups accountable.reports and holding groups accountable.

2. Process Owner:2. Process Owner: Process owner is the person who has initiated Process owner is the person who has initiated

the process in the system. It is necessary on his part to the process in the system. It is necessary on his part to ensure timely and accurate analysis of Distribution of ensure timely and accurate analysis of Distribution of outage reports.outage reports.

Page 34: Unit 5 Availability Related Concepts

Contd…Contd… 3. Customer Involvement:3. Customer Involvement: It is proved that the processes It is proved that the processes

which involve the basic understanding of the which involve the basic understanding of the customer needs are more successful than the customer needs are more successful than the others which don’t. Customer Involvement in others which don’t. Customer Involvement in the design and use of the processes plays a the design and use of the processes plays a vital role to indicate availability.vital role to indicate availability.

Page 35: Unit 5 Availability Related Concepts

Contd…Contd… 4. Supplier Involvement:4. Supplier Involvement: Involvement of key suppliers of Involvement of key suppliers of

hardware, software, service providers is hardware, software, service providers is necessary.necessary.

5. Service Metrics:5. Service Metrics: Analysis of metrics for trends such Analysis of metrics for trends such

as percentage of down time and value of time as percentage of down time and value of time lost due to outages.lost due to outages.

Page 36: Unit 5 Availability Related Concepts

Contd…Contd… 6. Process Metrics:6. Process Metrics: Extent to which process metrics are Extent to which process metrics are

analyzed for trends such as ease of quickness analyzed for trends such as ease of quickness with which servers can be rebooted.with which servers can be rebooted.

7. Process Integration:7. Process Integration: The degree to which availability The degree to which availability

process integrates with other processes and process integrates with other processes and tools such as overall network management.tools such as overall network management.

Page 37: Unit 5 Availability Related Concepts

Contd…Contd… 8. Streamlining or automation:8. Streamlining or automation: The extent to which availability The extent to which availability

process is streamlined by automating actions process is streamlined by automating actions such as generation and notification or issuing such as generation and notification or issuing of outage tickets.of outage tickets.

9. Training of Staff:9. Training of Staff: Training of staff on availability Training of staff on availability

process.process.

Page 38: Unit 5 Availability Related Concepts

Contd…Contd… 10. Process Documentation:10. Process Documentation: Quality and value of availability Quality and value of availability

documentation measured and maintained for documentation measured and maintained for future analysis.future analysis.