21
MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Embed Size (px)

Citation preview

Page 1: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

MS CLOUD DB - AZURE SQL DBFault Tolerance

bySubha VasudevanChristina Burnett

Page 2: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Windows AZURE Cloud Services

Page 3: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

AZURE Storage Services

● Blob● Table● Queue● File Storage

Page 4: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Azure SQL Database

Database as a Service● Predictable performance● Scalability● Business continuity● Data protection● Zero administration

Page 5: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Azure DB

Page 6: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Fault Tolerance and Failure

Why is it so important?● Supports

concurrency control● Provides

transactional guarantee

● ACID

Why does it fail?● Inevitable

software/hardware failure

● Human errors

Page 7: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Fault Tolerant SQL Database

● Redundant computers rather than redundant components.

● Fault tolerance at the highest level of the stack - Fault tolerant DB rather than fault tolerant DB servers.

● Database replication across fault zones.

● Failure Detection and Failover.

Page 8: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Fault Zones/Domains

Each fault zone is a fully independent physical sub-system with its own server racks and network routers.

Page 9: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Assigning Storage to a Fault Domain

Proximity vs. Isolation● Proximity of replicas affects network latency● Isolation helps ensure availability of replicas in

the event of a failure

Selection of replica location ● MDS codes● (N, K) coding

(Banerjee, Das, Mazumder, Derakhshandeh, & Sen, 2014)

Page 10: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Database Replication

There are 3 copies of each DB, a primary and two secondary replicas.The primary database performs the transactions, and sends the updates and DDL to the replicas.

Page 11: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Database Replication

Each replica is stored in a different fault zone.

Page 12: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Quorum-Based Commit

● At least two copies required.

● Data must be written to the primary and at least one secondary before it is considered committed.

Page 13: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

PRIMARY FAILSWhen the server containing the primary database fails, one of the secondary replicas is promoted to primary.

Dynamic Quorum

Page 14: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

SECONDARY FAILSWhen a server fails that contains secondary replicas, new replicas are created.

Dynamic Quorum

Page 15: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Transactional Consistency

● Updates are persisted in log

● Primary DB streams updates to secondaries

● Secondaries are asked to commit first

● Secondaries return acknowledgement

● Primary commits after quorum

Page 16: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Recovering Transactions

If secondary fails, on restart it checks with primary for transactions it may have missed.

Page 17: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Failure Detection● The database is paired with

the SQL Engine to detect failures in the neighborhood.

● Distributed failure detection - every node monitored by several neighbors.

● Efficient, localized and fast.● Prevents ping storms and

avoids delayed failure detection

Page 18: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Failover● If primary node fails unexpectedly,

standby backup node automatically assumes role of primary.

● Managed by GPM(Global Partition Manager).

● Distributed fabric maintains a global map

● GPM maintains the health, state and location of every DB.

● Fabric informs GPM of any node failure.● GPM reconfigures assignment of

primary and secondary DBs in failed node.

Gateway Processes

Client

psss

ssps

sssp

Page 19: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

Fault Tolerance in Application Design

Data Failure● application specific● catastrophic consequences● not addressed by Azure

Computational Failure● addressed by Azure

● controlled by application

Monitoring and Logging● diagnosis

● debugging(Jie Li et al., 2010)

Page 20: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

ReferencesFault-tolerance in Windows Azure SQL Database. [Online]. Available: http://azure.microsoft.com/blog/2012/07/30/fault-tolerance-in-windows-azure-sql-database/

Banerjee, S., Das, A., Mazumder, A., Derakhshandeh, Z., & Sen, A. (2014). On the impact of coding parameters on storage requirement of region-based fault tolerant distributed file system design. Paper presented at the Computing, Networking and Communications (ICNC), 2014 International Conference On, 78-82. doi:10.1109/ICCNC.2014.6785309

Jie Li, Humphrey, M., You-Wei Cheah, Youngryel Ryu, Agarwal, D., Jackson, K., & van Ingen, C. (2010). Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of MODIS Azure. Paper presented at the E-Science (E-Science), 2010 IEEE Sixth International Conference On, 246-253. doi:10.1109/eScience.2010.47

Rajan, D., Canino, A., Izaguirre, J. A., & Thain, D. (2011). Converting a high performance application to an elastic cloud application. Paper presented at the Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference On, 383-390. doi:10.1109/CloudCom.2011.58

Page 21: MS CLOUD DB - AZURE SQL DB Fault Tolerance by Subha Vasudevan Christina Burnett

QUESTIONS?