View
2.048
Download
3
Category
Preview:
DESCRIPTION
The occurrence of failures and crashes can compromise the high availability of your database system, affecting your revenue and reputation. Therefore, it is fundamental to minimize downtime and have an efficient strategy for crash recovery. Replication and failover are commonly applied to deal with those situations, but what if failures occur during the recovery process? This can really be a headache, so it is better to be prepared. This session discusses the development of fault-tolerant failover solutions using the MySQL utilities library and covers the following topics: • Issues during failover/switchover • Fault-tolerant failover solutions • Using the MySQL utilities library to provide your own solution
Citation preview
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.2
Development of Fault-Tolerant Failover Tools with MySQL Utilities
Dr. Paulo Jesus
Software Developer – MySQL Utilities
Dr. Lars Thalmann
MySQL Development Director
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.3
Program Agenda
Introduction: Faults, Fault-Tolerance and Failover
MySQL Failover
– How to Handle Master Crash, Slave Crash and Connection Failures
Automatically!
Introducing the mysqlfailover utility
– Fault-Tolerance and Taking Advantage of its Features
Final Remarks
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.4
Introduction: Faults, Fault-tolerance and Failover
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.5
Faults
Crash
– Server crash and stop due to a cosmic ray (more likely software/hardware
malfunction)
Message loss
– Communication channel fails
Byzantine
– Data corruption or malicious attacks
Different types of faults can lead to system failures
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.6
Fault-Tolerance
Provide high-availability
Improve reliability
Reduce system downtime and revenue loss
Typically obtained through redundancy
– Example: replication
Fault-tolerance is a property that enables the system to continue
operating properly when faults occur.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.7
Failover
Fault-tolerant strategy:
– Promote a slave as the new master
– Reduce system downtime
* Switchover is the transfer of the master role in an otherwise healthy topology
and is performed manually. In many cases, the original master can be returned
to the topology as a slave.
Failover can be achieved by automatically switching to another server
(redundant or standby) upon failure. *
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.8
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Asynchronous
replication
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.9
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Here comes the cosmic ray!!!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.10
Not critical...
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Partial impact:
– Reduce reads
performance
Replace the slave
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Slave crashed
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.11
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.12
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Connection failure
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.13
Not critical…
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Partial impact:
– Outdated data
read from slave
Restore connection
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Master-Slave Connection lost
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.14
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.15
Replication topology:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Master: A
Slaves: B, C and D
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Here comes another cosmic ray!!!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.16
Now, it gets serious!
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Replication stops
– Reads only
– No more writes
Failover is needed
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Master crashed
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.17
Failover process:
Context: MySQL ReplicationFaults, Fault-Tolerance and Failover
Promote the
“best”* slave as
master
Connect remaining
slaves to the new
master
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
* Here it is the most up-to-date (for simplicity).
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.18
Why ?
Context: MySQL ReplicationFailover Is Not So Easy…
New master might
not be the most up-
to-date *
– But must be at the
end of failover
Ensure data
consistency
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx3
trx1
trx2
* The “best” candidate might be the one with the best hardware.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.19
MySQL Failover
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.20
Relay log details:
Relay LogsMySQL Failover
Slave B: trx2
Slave C: <none>
Slave D: trx2, trx3
Slave B and C did
not receive trx3
Atrx1
trx2
trx3
B
C
Dtrx1
trx1
trx1
trx2
trx3
trx3trx2
trx2
trx3
binlog
relay
relay
relay
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.21
Master crashed!
Relay LogsMySQL Failover
Slave D is the most
up-do-date
– SHOW SLAVE STATUS
– Master log file and
position or
retrieved GTIDs
Atrx1
trx2
trx3
B
C
Dtrx1
trx1
trx1
trx2
trx3
trx3trx2
trx2
trx3
binlog
relay
relay
relay
Most up-to-date:
Check slave
status
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.22
Failover:
Relay LogsMySQL Failover
Promote D as the
new master
– STOP SLAVE,
CHANGE MASTER
TO*, START SLAVE
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx1
trx2
trx2
trx2
trx3
binlog
relay
relay
relay
Hum... Something is not right...
binlog
* CHANGE MASTER TO deletes all relay log files and starts a
new one, unless relay_log_file or relay_log_pos are specified.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.23
(1) What can happen:
Relay LogsMySQL Failover
“D” consume trxs on
relay log
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx3
trx1
trx2
trx2
binlog
relay
relay
relay binlog
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.24
(1) What can happen:
Relay LogsMySQL Failover
“D” consume trxs on
relay log
Slaves read trxs
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx2
trx3
trx2
trx3
binlog
relay
relay
relay binlog
trx2, trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.25
(1) What can happen:
Relay LogsMySQL Failover
“D” consume trxs on
relay log
Slaves read trxs
Slaves apply trxs
SQL thread stops on “C”
Atrx1
trx2
trx3
B
C
D
trx1
trx2
trx3
trx1
trx2
trx3
trx1
trx2
trx2
trx3
binlog
relay
relay
relay
Conflicting transactions!
binlog
SQL thread error!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.26
(2) What can happen:
Relay LogsMySQL Failover
Delete relay log on
the new master
– RESET SLAVE
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx1
trx2
trx2
trx3
binlog
relay
relay
relay binlog
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.27
(2) What can happen:
Relay LogsMySQL Failover
Delete relay log on
the new master
– RESET SLAVE
trx2 only on slave B and
trx3 is lost.
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx1
trx2
trx2
trx3
binlog
relay
relay
relay binlog
trx3 is lost!
Transactions lost!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.28
Solution:
Relay LogsMySQL Failover
Consume all
transactions on the
relay logs before
promoting the new
master
Atrx1
trx2
trx3
B
C
D
trx1
trx2
trx1
trx2
trx3
trx1
trx2
binlog
relay
relay
relay binlog
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.29
New master is not the most up-to-date:
New Master SynchronizationMySQL Failover
“Best” candidate is
the one with better:
– Hardware
– Location
It is not the most up-
to-date
Atrx1
trx2
trx3
B
C
Dtrx1
trx1
trx1
trx2
trx3
trx3trx2
trx2
trx3
binlog
relay
relay
relay
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.30
Failover:
New Master SynchronizationMySQL Failover
Slave B is the “best”
candidate
Consume relay logs
Promote “B” to
master
Atrx1
trx2
trx3B
C
D
trx1
trx2
trx1
trx2
trx3
trx1
trx2
binlog
relay
relay
relay
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.31
Failover:
New Master SynchronizationMySQL Failover
Slave B is the “best”
candidate
Consume relay logs
Promote “B” to
master
Atrx1
trx2
trx3B
C
D
trx1
trx2
trx1
trx2
trx3
trx1
trx2
binlog
relay
relay
relay
Data consistency issue!
trx3 is only on “D”!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.32
Solution:
New Master SynchronizationMySQL Failover
Synchronize candidate
before promoting it:
Get and apply
differential changes
from the most up-to-
date slave or loop
throw all slaves
Atrx1
trx2
trx3B
C
D
trx1
trx2
trx3
trx1
trx2
trx3
trx1
trx2
binlog
relay
relay
relay
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.33
Solution:
New Master SynchronizationMySQL Failover
New master must
become one of the
most up-to-date
before being
promoted
Atrx1
trx2
trx3B
C
D
trx1
trx2
trx3
trx1
trx2
trx3
trx1
trx2
binlog
relay
relay
relay
Note: If candidate is too delayed it might take too much
time to synchronize... Better to choose another one?
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.34
Failover With or Without GTIDs
Without GTIDs:
– Handle files and positions
Headache to determine file and position to synchronize candidates
Headache to determine file and position for CHANGE MASTER TO
With GTIDs (from MySQL 5.6.5):
– Handle GTID sets
Simple sets manipulation to determine missing transactions
Automatic: CHANGE MASTER TO ... MASTER_AUTO_POSITION = 1
No headache to determine new master’s file and position
MySQL Failover
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.35
Failover With or Without GTIDs
Use of GTIDs simplifies the failover task:
– Removes complexity of handling files and positions
– Opens the door to support more complex multi-tier replication topologies
Still a difficult task to achieve (manually):
– Detecting master failure
– Requires profound knowledge of MySQL replication
– Is time consuming and error-prone
Wouldn’t it be great if we had a tool to do it automatically?
MySQL Failover
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.36
Introducing the mysqlfailover Utility
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.37
Introducing the mysqlfailover Utility
Features
– Configurable timeouts to detect master failures (interval and ping)
– Configurable failover modes (auto, elect, fail)
– Optimized failover algorithm
http://svenmysql.blogspot.pt/2013/03/flexible-fail-over-policies-using-mysql.html
– Automatic slave discovery
– Logging to file
Utility to report the health of a replication topology (master and its slaves)
and perform automatic failover.
New
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.38
Introducing the mysqlfailover Utility
Features
– Two operation modes:
Console (default)
Daemon (only on POSIX platforms)
– Support for new authentication mechanism (login-path)
– Detect execution of multiple instances (for the same master)
– Detect errant transactions
– Pedantic execution mode
– Extension points (allow execution of external scripts)
New
New
New
New
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.39
Introducing the mysqlfailover Utility
Requirements
– All servers with version >= MySQL 5.6.9 and GTID_MODE=ON:
--report-host, --report-port, --log-slave-updated, --enforce-gtid-consistency, and
--master-info-repository=TABLE
More information
– http://dev.mysql.com/doc/workbench/en/mysqlfailover.html
Example of use
mysqlfailover --master=m1 --slaves=s2,s1,s3 --daemon=start --log=failover.txt
Note: m1, s1, s2 and s3 are login-path in the mylogin.cnf file used for security and
usability purposes, avoiding the use of plain text connection strings, such as
rpluser:mypass@master_host:3306.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.40
mysqlfailover in action:
Introducing the mysqlfailover Utility
Master fault detection
– Are you alive?
Get health data from
master and slaves
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Simple, open source, no impact on system
performance!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.41
Introducing the mysqlfailover Utility
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.42
Fault-Tolerance of mysqlfailover and Taking Advantage of its Features
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.43
Fault-Tolerance of mysqlfailover
How faults affect mysqlfailover?
– Server crash
– Connection failure
How to improve fault-tolerance?
– Tips and tricks
– Extension points
Errant transactions
– Good practice to avoid them
What will be discussed
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.44
Server crash:
Fault-Tolerance of mysqlfailover
Utility stopped
Replication topology
not affected
Easy to start a new
instance of the utility
– Use --force option,
overwrite instance
registration
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.45
Connection failure (1):
Fault-Tolerance of mysqlfailover
Fault detection
– Suspect master
has failed
Failover
– Slave promoted to
new master
– Old master still
alive, abandoned
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.46
Connection failure (2):
Fault-Tolerance of mysqlfailover
Fault detection
– Suspect master
has failed
Failover
– Split replication
topology
Atrx1
trx2
trx3
B
C
Dtrx1
trx2
trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.47
Improve fault-tolerance:
Fault-Tolerance of mysqlfailover
Run mysqlfailover
close to the master,
but not on it
Carefully set “interval”
and “ping” values
Monitor mysqlfailover
Use extension points
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx3
trx1
trx2
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.48
Extension Points of mysqlfailover
--exec-fail-check: script executed on each interval to replace default
failure detection
– E.g., customize failure detector according to replication topology (test
master connection through different network paths)
--exec-before: script to execute before starting failover
– E.g., shutdown master’s system (to ensure only one master will be online)
Allow the execution of external scripts to override the behavior of the
utility at specific points using the following options:
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.49
Extension Points of mysqlfailover
--exec-after: script to execute at the end of failover
– E.g., change network setting to direct client writes to the new master
--exec-post-failover: script to execute after completing the failover
process (successfully or not)
– E.g., send a notification to the administrator with the failover result
Access to variables with master’s information from scripts
– $1: old master host, $2: old master port, $3: new master host, $4: new
master port
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.50
Slave transaction that is not on all slaves
Errant Transactions
Examples: b-1, c-1
– Different GTID
– Can be the same
SQL command
CREATE TABLE
table1
– Each one only on
one slave
Aa-1
a-2
a-3
B
C
Da-1
b-1
a-1
a-2
a-3
a-1
a-2
c-1
trx2, trx3
trx3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.51
Can lead to errors during/after failover:
Errant Transactions
Both b-1 and c-1
– CREATE TABLE table1
New master B cannot
apply c-1, and slave C
cannot apply b-1: Table
“table1” already exists.
Aa-1
a-2
a-3
B
C
D
a-1
b-1
a-2
a-3
a-1
a-2
a-3
a-1
a-2
c-1SQL Error!
b-1, a-3
b-1
Conflicting transactions!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.52
Errant Transactions
mysqlfailover identifies errant transactions
– Error when starting the utility
– Warning while executing unless the --pedantic option is used
Can only be fixed manually
– Commit empty transaction on slaves with the GTID of the errant one
SET GTID_NEXT=‘…’; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC‘;
Good practice
– Disable binary log (SET sql_log_bin = 0) when executing transactions
locally on slaves. So that they are not replicated!
How to handle this problem? Avoid it!mysqlfailover
is clever!
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.53
Final Remarks
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.54
Final Remarks
Faults can greatly impact your system (replication topology)
– System down
– Revenue lost
– Negative reputation
Failover
– Fault-tolerant strategy for high availability
– Works in most situations
Fault detectors make mistakes, failover process can fail, etc.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.55
Final Remarks
Simple and easy to use
Advanced features
– Perform several configuration checks (to prevent failures)
– Optimized failover algorithm
– Configurable monitoring timeouts and operation modes
– Extension points
Building block for more sophisticated and fault-tolerant tools
– Taking advantage of extension points or MySQL Utilities Library (Open
Source)
The mysqlfailover utility
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.56
By The Way…
There are many other useful MySQL Utilities
mysqldbcompare
mysqldbcopy
mysqldbexport
mysqldbimport
mysqldiff
Database Ops mysqlauditadmin
mysqlauditgrep
Specific – Audit log
mysqlserverclone
mysqlserverinfo
Server Ops
mysqluc
Usability
msqldiskusage
mysqlindexcheck
mysqlmetagrep
mysqlprocgrep
mysqluserclone
mysqlfrm
Generic/Support
New
New
New
New
mysqlfailover
mysqlreplicate
mysqlrpladmin
mysqlrplcheck
mysqlrplshow
High Availability
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.57
Do You Want More?
MySQL Utilities (1.3.5 GA)
– Download: http://dev.mysql.com/downloads/tools/utilities/
Launchpad: https://launchpad.net/mysql-utilities
– Docs: http://dev.mysql.com/doc/workbench/en/mysql-utilities.html .
Contributing Ideas:
– Community users can use: http://bugs.mysql.com (MySQL Workbench: Utilities).
– Oracle customers can use: bug.oraclecorp.com (Product = MySQL Workbench,
Component = WBUTILS).
Send us an e-mail (please use sparingly):
– Charles A. Bell, PhD (Team Lead): chuck.bell@oracle.com
– Paulo Jesus, PhD (Developer): paulo.jesus@oracle.com
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.58
Questions ?Paulo Jesuspaulo.jesus@oracle.com
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.59
Graphic Section Divider
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.60
Replication topology:
Another Fault Scenario
Master: A
Slaves: B, C and D
Asynchronous
replication
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx1
trx2
trx2, trx3
trx3
trx3
trx3 did not reach any slave (yet)
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.61
Master crashed:
Another Fault Scenario
trx3 didn’t reach
any slave
Atrx1
trx2
trx3
B
C
Dtrx1 trx1
trx2
trx1
trx2
trx2, trx3
trx3
trx3
Is failover enough to solve this?
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.62
What can we do?
Another Fault Scenario
With failover
– Try to get missing
transaction from
master (if possible)
Semi-synchronous
replication helps
Atrx1
trx2
trx3
B
C
D
trx1
trx1
trx2
trx1
trx2
trx2
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.63
Recommended