Disaster Recovery - On-Premise & Cloud

Preview:

DESCRIPTION

We will cover different scenarios for Disaster Recovery

Citation preview

CLOUDCONF 2014Database: backup e disaster recovery in Cloud

Walter Dal Mut

@walterdalmut – www.corley.it – walterdalmut.com

DISASTER RECOVERYDisaster recovery (DR) is about preparing for and recovering

from a disaster.

DISASTERAny event that has a negative impact on

your business continuity or finances could be termed a disaster.

WHY WE ARE TALKING ABOUT DR?• Over 70% of businesses involved in a major fire either do not reopen, or

subsequently fail within 3 years of fire. (Source continuitycentral.com)

• 80% of businesses affected by a majorincident either never re-open or close within 18 months (Source Axa)

• 70 percent of companies go out of business after a major data loss  (Source continuitycentral.com)

• 80% of businesses suffering a computer disaster, who have no disaster recovery plans, go out of business. (Source “A Bridge Too Far”, IBM BusinessRecovery Service & Cranfield, 1993)

• A recent study from Gartner, Inc., found that 90 percent of companies that experience data loss go out of business within two years.

• 80 percent of companies without well-conceived data protection and recovery strategies go out of business within 2 years of a major disaster.  (Source: US National Archives and Records Administration)

RTO – RECOVERY TIME OBJECTIVE

This is the duration of time and the service level to which a business process must be restored after a disaster

RTO what it implies?

• Have a system that records 1000 transaction at hour

• Take a snapshot of a system at 03:00 am (every day)

• 10:00 am a disaster event occurs

• You spend 1 hour to sort things out for the backup (off-site, preparation, etc.)

• Recover operation takes 4 hours in order to get back to operate (at minimum service level)

• 5 hours is the: RECOVERY TIME OBJECTIVE

RPO – RECOVERY POINT OBJECTIVE

This describes the acceptable amount of data loss measured in time.

RPO – WHAT IT IMPLIES?

• Have a system that records 1000 transaction at hour

• Take a snaphot of a system at 03:00 am (every day)

• 10:00 am a disaster event occurs

• In this case we lost around 7000 transactions.• 1000 transactions 03:00 04:00• 1000 transactions 04:00 05:00• …

• But: we are accepting 24 hours of data loss 24000 transactions (RPO)

DISASTER RECOVERY STRATEGIES

Local tape backup

Online backup

Pilot-Light

Warm Stand-by

And More…

$ $$$ $$$$$$

Seconds

Days

ON-PREMISE & CLOUD

Use cloud resources in order to provide business continuity

Disaster Recovery & Cloud?

•On Demand•We can allocate and release new resources whenever we need

•Cost Effective•Pay as you go model. We pay only for resources that we are effectively using

•Scalable•We can scale freely and adapt our strategy thanks to autoscaling and other mechanisms

•Secure•Control doesn’t mean security

FOCUS ON DATABASES

We will focus on MySQL but you can apply to your infrastructure without any problem.

BACKUP & RESTORETake a snapshot of a system and restore it when you need it

Application

Backup

Restore

RTO & RPO?Things to remember…

RTOWhat resources can impact on my RTO

RESOURCES ALLOCATION

How fast we can set up all resources, eg: instances, network, etc etc.

DB RESTOREHow many time the database restore can takes?

RPOWhat resources can impact on my RPO

DB SNAPSHOTHow many time we need to recover all data from our

snapshot?

Backup & Restore – RPO & RTO

Configuration

• Resources Allocation• ???

• Restore Operation• ???

• DNS • TTL 30 minutes

• Snapshot• Every 24 hour

Effects

• RTO – Recovery Time Objective• 30 minutes + ??? + ???

• RPO – Recovery Point Objective• 24 hour

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

COSTS ON S3 (AWS)0.085$ per GB durability

99,999999999%

$0.068 / GB durability 99,99%

$0.010 / GB durability 99.999999999% [glacier]

Pilot lightWe can let a little resource always active that can help us to activate a whole system

ON-PREMISE – WEB APP

READ REPLICA ON A CLOUD PROVIDER

MOVE TO CLOUD ON A DISASTER

RTO & RPO?Things to remember…

RTOWhat resources can impact on my RTO

RESOURCES ALLOCATION

run and configure new instances typically takes a couple of minutes

you have always to care about resources and times.

DNS PROPAGATIONDNS takes a little while before propagate new addresses

(Time To Live)

RPOWhat resources can impact on my RPO

DB REPLICATIONRemember that Master/Slave replications are ASYNC!

It implies LAG replication time and that impact with your RPO!

MONITOR YOUR INFRASTRUCTURE

Setting an RPO about 20 minutes implies that your replication LAG time should be always under 20 minutes!

Pilot Light – RPO & RTO

Configuration

• Resources Allocation• 20 minutes

• DNS • TTL 30 minutes

• Replication LAG• 20 minutes

Effects

• RTO – Recovery Time Objective• 50 minutes

• RPO – Recovery Point Objective• 20 minutes

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

COSTS ON AWS0.06$ per hour 1 m1.small~43$ per

month

0.05$ per GB EBS

0.05$ per 1 million I/O requests EBS

WARM STANDBYExtends pilot-light resource allocation and preparation

Warm Standby

Warm Stand-by

Warm StandBy – RPO & RTO

Configuration

• Resources Allocation• 5 minutes

• DNS • TTL 30 minutes

• Replication LAG• 20 minutes

Effects

• RTO – Recovery Time Objective• 35 minutes

• RPO – Recovery Point Objective• 20 minutes

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

COSTS ON AWS0.06$ per hour 2 m1.small~86$ per

month

0.05$ per GB EBS

0.05$ per 1 million I/O requests EBS

ELB 20$ per month

PILOT LIGHTVS

WARM STAND-BYEffectively in our examples

Pilot Light is much more effective than warm stand-by.

Doesn’t it?

DEPENDS ON ASSUMPTIONS

We assume that we don’t need to scale out our database but that is enough to scale it up only!

Resource allocation for new read replicas? How long does it takes?

THANKS FOR LISTENING

Recommended