46
CLOUDCONF 2014 Database: backup e disaster recovery in Cloud Walter Dal Mut @walterdalmut – www.corley.it – walterdalmut.com

Disaster Recovery - On-Premise & Cloud

Embed Size (px)

DESCRIPTION

We will cover different scenarios for Disaster Recovery

Citation preview

Page 1: Disaster Recovery - On-Premise & Cloud

CLOUDCONF 2014Database: backup e disaster recovery in Cloud

Walter Dal Mut

@walterdalmut – www.corley.it – walterdalmut.com

Page 2: Disaster Recovery - On-Premise & Cloud

DISASTER RECOVERYDisaster recovery (DR) is about preparing for and recovering

from a disaster.

Page 3: Disaster Recovery - On-Premise & Cloud

DISASTERAny event that has a negative impact on

your business continuity or finances could be termed a disaster.

Page 4: Disaster Recovery - On-Premise & Cloud

WHY WE ARE TALKING ABOUT DR?• Over 70% of businesses involved in a major fire either do not reopen, or

subsequently fail within 3 years of fire. (Source continuitycentral.com)

• 80% of businesses affected by a majorincident either never re-open or close within 18 months (Source Axa)

• 70 percent of companies go out of business after a major data loss  (Source continuitycentral.com)

• 80% of businesses suffering a computer disaster, who have no disaster recovery plans, go out of business. (Source “A Bridge Too Far”, IBM BusinessRecovery Service & Cranfield, 1993)

• A recent study from Gartner, Inc., found that 90 percent of companies that experience data loss go out of business within two years.

• 80 percent of companies without well-conceived data protection and recovery strategies go out of business within 2 years of a major disaster.  (Source: US National Archives and Records Administration)

Page 5: Disaster Recovery - On-Premise & Cloud

RTO – RECOVERY TIME OBJECTIVE

This is the duration of time and the service level to which a business process must be restored after a disaster

Page 6: Disaster Recovery - On-Premise & Cloud

RTO what it implies?

• Have a system that records 1000 transaction at hour

• Take a snapshot of a system at 03:00 am (every day)

• 10:00 am a disaster event occurs

• You spend 1 hour to sort things out for the backup (off-site, preparation, etc.)

• Recover operation takes 4 hours in order to get back to operate (at minimum service level)

• 5 hours is the: RECOVERY TIME OBJECTIVE

Page 7: Disaster Recovery - On-Premise & Cloud

RPO – RECOVERY POINT OBJECTIVE

This describes the acceptable amount of data loss measured in time.

Page 8: Disaster Recovery - On-Premise & Cloud

RPO – WHAT IT IMPLIES?

• Have a system that records 1000 transaction at hour

• Take a snaphot of a system at 03:00 am (every day)

• 10:00 am a disaster event occurs

• In this case we lost around 7000 transactions.• 1000 transactions 03:00 04:00• 1000 transactions 04:00 05:00• …

• But: we are accepting 24 hours of data loss 24000 transactions (RPO)

Page 9: Disaster Recovery - On-Premise & Cloud

DISASTER RECOVERY STRATEGIES

Local tape backup

Online backup

Pilot-Light

Warm Stand-by

And More…

$ $$$ $$$$$$

Seconds

Days

Page 10: Disaster Recovery - On-Premise & Cloud

ON-PREMISE & CLOUD

Use cloud resources in order to provide business continuity

Page 11: Disaster Recovery - On-Premise & Cloud

Disaster Recovery & Cloud?

•On Demand•We can allocate and release new resources whenever we need

•Cost Effective•Pay as you go model. We pay only for resources that we are effectively using

•Scalable•We can scale freely and adapt our strategy thanks to autoscaling and other mechanisms

•Secure•Control doesn’t mean security

Page 12: Disaster Recovery - On-Premise & Cloud

FOCUS ON DATABASES

We will focus on MySQL but you can apply to your infrastructure without any problem.

Page 13: Disaster Recovery - On-Premise & Cloud

BACKUP & RESTORETake a snapshot of a system and restore it when you need it

Page 14: Disaster Recovery - On-Premise & Cloud

Application

Page 15: Disaster Recovery - On-Premise & Cloud

Backup

Page 16: Disaster Recovery - On-Premise & Cloud

Restore

Page 17: Disaster Recovery - On-Premise & Cloud

RTO & RPO?Things to remember…

Page 18: Disaster Recovery - On-Premise & Cloud

RTOWhat resources can impact on my RTO

Page 19: Disaster Recovery - On-Premise & Cloud

RESOURCES ALLOCATION

How fast we can set up all resources, eg: instances, network, etc etc.

Page 20: Disaster Recovery - On-Premise & Cloud

DB RESTOREHow many time the database restore can takes?

Page 21: Disaster Recovery - On-Premise & Cloud

RPOWhat resources can impact on my RPO

Page 22: Disaster Recovery - On-Premise & Cloud

DB SNAPSHOTHow many time we need to recover all data from our

snapshot?

Page 23: Disaster Recovery - On-Premise & Cloud

Backup & Restore – RPO & RTO

Configuration

• Resources Allocation• ???

• Restore Operation• ???

• DNS • TTL 30 minutes

• Snapshot• Every 24 hour

Effects

• RTO – Recovery Time Objective• 30 minutes + ??? + ???

• RPO – Recovery Point Objective• 24 hour

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

Page 24: Disaster Recovery - On-Premise & Cloud

COSTS ON S3 (AWS)0.085$ per GB durability

99,999999999%

$0.068 / GB durability 99,99%

$0.010 / GB durability 99.999999999% [glacier]

Page 25: Disaster Recovery - On-Premise & Cloud

Pilot lightWe can let a little resource always active that can help us to activate a whole system

Page 27: Disaster Recovery - On-Premise & Cloud

ON-PREMISE – WEB APP

Page 28: Disaster Recovery - On-Premise & Cloud

READ REPLICA ON A CLOUD PROVIDER

Page 29: Disaster Recovery - On-Premise & Cloud

MOVE TO CLOUD ON A DISASTER

Page 30: Disaster Recovery - On-Premise & Cloud

RTO & RPO?Things to remember…

Page 31: Disaster Recovery - On-Premise & Cloud

RTOWhat resources can impact on my RTO

Page 32: Disaster Recovery - On-Premise & Cloud

RESOURCES ALLOCATION

run and configure new instances typically takes a couple of minutes

you have always to care about resources and times.

Page 33: Disaster Recovery - On-Premise & Cloud

DNS PROPAGATIONDNS takes a little while before propagate new addresses

(Time To Live)

Page 34: Disaster Recovery - On-Premise & Cloud

RPOWhat resources can impact on my RPO

Page 35: Disaster Recovery - On-Premise & Cloud

DB REPLICATIONRemember that Master/Slave replications are ASYNC!

It implies LAG replication time and that impact with your RPO!

Page 36: Disaster Recovery - On-Premise & Cloud

MONITOR YOUR INFRASTRUCTURE

Setting an RPO about 20 minutes implies that your replication LAG time should be always under 20 minutes!

Page 37: Disaster Recovery - On-Premise & Cloud

Pilot Light – RPO & RTO

Configuration

• Resources Allocation• 20 minutes

• DNS • TTL 30 minutes

• Replication LAG• 20 minutes

Effects

• RTO – Recovery Time Objective• 50 minutes

• RPO – Recovery Point Objective• 20 minutes

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

Page 38: Disaster Recovery - On-Premise & Cloud

COSTS ON AWS0.06$ per hour 1 m1.small~43$ per

month

0.05$ per GB EBS

0.05$ per 1 million I/O requests EBS

Page 39: Disaster Recovery - On-Premise & Cloud

WARM STANDBYExtends pilot-light resource allocation and preparation

Page 40: Disaster Recovery - On-Premise & Cloud

Warm Standby

Page 41: Disaster Recovery - On-Premise & Cloud

Warm Stand-by

Page 42: Disaster Recovery - On-Premise & Cloud

Warm StandBy – RPO & RTO

Configuration

• Resources Allocation• 5 minutes

• DNS • TTL 30 minutes

• Replication LAG• 20 minutes

Effects

• RTO – Recovery Time Objective• 35 minutes

• RPO – Recovery Point Objective• 20 minutes

• Downtime per month• 99.8% availability 86.23 minutes• 99.95% availability 21.56 minutes

Page 43: Disaster Recovery - On-Premise & Cloud

COSTS ON AWS0.06$ per hour 2 m1.small~86$ per

month

0.05$ per GB EBS

0.05$ per 1 million I/O requests EBS

ELB 20$ per month

Page 44: Disaster Recovery - On-Premise & Cloud

PILOT LIGHTVS

WARM STAND-BYEffectively in our examples

Pilot Light is much more effective than warm stand-by.

Doesn’t it?

Page 45: Disaster Recovery - On-Premise & Cloud

DEPENDS ON ASSUMPTIONS

We assume that we don’t need to scale out our database but that is enough to scale it up only!

Resource allocation for new read replicas? How long does it takes?

Page 46: Disaster Recovery - On-Premise & Cloud

THANKS FOR LISTENING