Disaster Recovery - On-Premise & Cloud

  • View
    484

  • Download
    1

Embed Size (px)

DESCRIPTION

We will cover different scenarios for Disaster Recovery

Text of Disaster Recovery - On-Premise & Cloud

  • 1. CLOUDCONF 2014 Database: backup e disaster recovery in Cloud Walter Dal Mut @walterdalmut www.corley.it walterdalmut.com

2. DISASTER RECOVERY Disaster recovery (DR) is about preparing for and recovering from a disaster. 3. DISASTER Any event that has a negative impact on your business continuity or finances could be termed a disaster. 4. WHYWEARETALKINGABOUT DR? Over 70% of businesses involved in a major fire either do not reopen, or subsequently fail within 3 years of fire. (Source continuitycentral.com) 80% of businesses affected by a major incident either never re-open or close within 18 months (SourceAxa) 70 percent of companies go out of business after a major data loss (Source continuitycentral.com) 80% of businesses suffering a computer disaster, who have no disaster recovery plans, go out of business. (Source A BridgeToo Far, IBM BusinessRecovery Service & Cranfield, 1993) A recent study from Gartner, Inc., found that 90 percent of companies that experience data loss go out of business within two years. 80 percent of companies without well-conceived data protection and recovery strategies go out of business within 2 years of a major disaster. (Source: US NationalArchives and Records Administration) 5. RTO RECOVERYTIME OBJECTIVE This is the duration of time and the service level to which a business process must be restored after a disaster 6. RTO what it implies? Have a system that records 1000 transaction at hour Take a snapshot of a system at 03:00 am (every day) 10:00 am a disaster event occurs You spend 1 hour to sort things out for the backup (off-site, preparation, etc.) Recover operation takes 4 hours in order to get back to operate (at minimum service level) 5 hours is the: RECOVERYTIME OBJECTIVE 7. RPO RECOVERY POINT OBJECTIVE This describes the acceptable amount of data loss measured in time. 8. RPO WHAT IT IMPLIES? Have a system that records 1000 transaction at hour Take a snaphot of a system at 03:00 am (every day) 10:00 am a disaster event occurs In this case we lost around 7000 transactions. 1000 transactions 03:00 04:00 1000 transactions 04:00 05:00 But: we are accepting 24 hours of data loss 24000 transactions (RPO) 9. DISASTER RECOVERY STRATEGIES Local tape backup Online backup Pilot-Light Warm Stand-by And More $ $$$ $$$$$$ Seconds Days 10. ON-PREMISE & CLOUD Use cloud resources in order to provide business continuity 11. Disaster Recovery & Cloud? On Demand We can allocate and release new resources whenever we need Cost Effective Pay as you go model.We pay only for resources that we are effectively using Scalable We can scale freely and adapt our strategy thanks to autoscaling and other mechanisms Secure Control doesnt mean security 12. FOCUS ON DATABASES We will focus on MySQL but you can apply to your infrastructure without any problem. 13. BACKUP & RESTORE Take a snapshot of a system and restore it when you need it 14. Application 15. Backup 16. Restore 17. RTO & RPO? Things to remember 18. RTO What resources can impact on my RTO 19. RESOURCES ALLOCATION How fast we can set up all resources, eg: instances, network, etc etc. 20. DB RESTORE How many time the database restore can takes? 21. RPO What resources can impact on my RPO 22. DB SNAPSHOT How many time we need to recover all data from our snapshot? 23. Backup & Restore RPO & RTO Configuration Resources Allocation ??? Restore Operation ??? DNS TTL 30 minutes Snapshot Every 24 hour Effects RTO RecoveryTime Objective 30 minutes + ??? + ??? RPO Recovery Point Objective 24 hour Downtime per month 99.8% availability 86.23 minutes 99.95% availability 21.56 minutes 24. COSTS ON S3 (AWS) 0.085$ per GB durability 99,999999999% $0.068 / GB durability 99,99% $0.010 / GB durability 99.999999999% [glacier] 25. Pilot light We can let a little resource always active that can help us to activate a whole system 26. Replication Basically pilot-light is based on database replication strategies For MySQL async replication is used as base strategy http://www.slideshare.net/corleycloud/m ysql-scale-out-cloudparty-2013-milano- talent-garden 27. ON-PREMISE WEB APP 28. READ REPLICA ON A CLOUD PROVIDER 29. MOVETO CLOUD ON A DISASTER 30. RTO & RPO? Things to remember 31. RTO What resources can impact on my RTO 32. RESOURCES ALLOCATION run and configure new instances typically takes a couple of minutes you have always to care about resources and times. 33. DNS PROPAGATION DNS takes a little while before propagate new addresses (TimeTo Live) 34. RPO What resources can impact on my RPO 35. DB REPLICATION Remember that Master/Slave replications are ASYNC! It implies LAG replication time and that impact with your RPO! 36. MONITORYOUR INFRASTRUCTURE Setting an RPO about 20 minutes implies that your replication LAG time should be always under 20 minutes! 37. Pilot Light RPO & RTO Configuration Resources Allocation 20 minutes DNS TTL 30 minutes Replication LAG 20 minutes Effects RTO RecoveryTime Objective 50 minutes RPO Recovery Point Objective 20 minutes Downtime per month 99.8% availability 86.23 minutes 99.95% availability 21.56 minutes 38. COSTS ON AWS 0.06$ per hour 1 m1.small~43$ per month 0.05$ per GB EBS 0.05$ per 1 million I/O requests EBS 39. WARM STANDBY Extends pilot-light resource allocation and preparation 40. Warm Standby 41. Warm Stand-by 42. Warm StandBy RPO & RTO Configuration Resources Allocation 5 minutes DNS TTL 30 minutes Replication LAG 20 minutes Effects RTO RecoveryTime Objective 35 minutes RPO Recovery Point Objective 20 minutes Downtime per month 99.8% availability 86.23 minutes 99.95% availability 21.56 minutes 43. COSTS ON AWS 0.06$ per hour 2 m1.small~86$ per month 0.05$ per GB EBS 0.05$ per 1 million I/O requests EBS ELB 20$ per month 44. PILOT LIGHT VS WARM STAND-BY Effectively in our examples Pilot Light is much more effective than warm stand-by. Doesnt it? 45. DEPENDS ON ASSUMPTIONS We assume that we dont need to scale out our database but that is enough to scale it up only! Resource allocation for new read replicas? How long does it takes? 46. THANKS FOR LISTENING