Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Percona XtraDB Cluster Tutorialhttp://www.percona.com/training/
© 2011 - 2017 Percona, Inc. 1 / 102
Table of Contents
1. Overview & Setup 6. Monitoring Galera
2. Migrating from M/S 7. Incremental StateTransfer
3. Galera ReplicationInternals 8. Node Failure Scenarios
4. Online Schema Changes 9. Avoiding SST5. Application HA 10. Tuning Replication
© 2011 - 2017 Percona, Inc. 2 / 102
XtraDB Cluster TutorialOVERVIEW
© 2011 - 2017 Percona, Inc. 3 / 102
Resourceshttp://www.percona.com/doc/percona-xtradb-cluster/5.7/http://www.percona.com/blog/category/percona-xtradb-cluster/http://galeracluster.com/documentation-webpages/reference.html
© 2011 - 2017 Percona, Inc. 4 / 102
Standard ReplicationReplication is a mechanism for recording a series ofchanges on one MySQL server and applying the samechanges to one, or more, other MySQL server.The source is the "master" and its replica is a "slave."
© 2011 - 2017 Percona, Inc. 5 / 102
Galera ReplicationIn Galera, the dataset is synchronized between one ormore servers.You can write to any node; every node stays in sync.
© 2011 - 2017 Percona, Inc. 6 / 102
Our Current SetupSingle master node; Two slave nodesnode1 will also run sysbench
Sample application, simulating activity
© 2011 - 2017 Percona, Inc.
(*) Your node IP addresses will probably be different.
7 / 102
Verify Connectivity and SetupConnect to all 3 nodes
Use separate terminal windows; we will be switchingbetween them frequently.Password is vagrant
ssh -p 22201 vagrant@localhostssh -p 22202 vagrant@localhostssh -p 22203 vagrant@localhost
Verify MySQL is runningVerify replication is running
SHOW SLAVE STATUS\G
© 2011 - 2017 Percona, Inc. 8 / 102
Run Sample ApplicationStart the application and verify
node1# run_sysbench_oltp.sh
Is replication flowing to all the nodes?Observe the sysbench transaction rate
© 2011 - 2017 Percona, Inc. 9 / 102
Checking ConsistencyRun a pt-table-checksum from node1
node1# pt-table-checksum -d test
Check for differences
node3> SELECT db, tbl, SUM(this_cnt) AS numrows, COUNT(*) AS chunks FROM percona.checksumsWHERE master_crc <> this_crc GROUP BY db, tbl;
See all the results
node3> SELECT * FROM percona.checksums;
Are there any differences between node1 and its slaves?If so, why?
© 2011 - 2017 Percona, Inc. 10 / 102
Monitoring Replication LagStart a heartbeat on node1
node1# pt-heartbeat --update --database percona \ --create-table --daemonize
Monitor the heartbeat on node2
node2# pt-heartbeat --monitor --database percona \ --master-server-id=1
Stop the heartbeat on node1 and see how that affects themonitor.STOP SLAVE on node2 for a while and see how thataffects the monitor.
© 2011 - 2017 Percona, Inc. 11 / 102
XtraDB Cluster TutorialMIGRATING FROM MASTER/SLAVE
© 2011 - 2017 Percona, Inc. 12 / 102
What is Our Plan?We have an application working on node1.We want to migrate our master/slave to a PXC cluster withminimal downtime.We will build a cluster from one slave and then add theother servers to it one at a time.
© 2011 - 2017 Percona, Inc. 13 / 102
Upgrade node3node3# systemctl stop mysql
node3# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57
node3# systemctl start mysql
node3# mysql_upgrade
node3# mysql -e "show slave status\G"
node3# systemctl stop mysql -- For the next steps
The Percona XtraDB Cluster Server and Client packagesare drop-in replacements for Percona Server and evencommunity MySQL.
© 2011 - 2017 Percona, Inc. 14 / 102
Configure Node3 for PXC[mysqld]# Leave existing settings and add these
binlog_format = ROW
# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node3wsrep_node_address = 192.168.70.4wsrep_sst_auth = sst:secret
innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = ON
Start node3 with:systemctl start mysql@bootstrap
© 2011 - 2017 Percona, Inc. 15 / 102
Checking Cluster StateCan you tell:
How many nodes are in the cluster?Is the cluster Primary?Are cluster replication events being generated?Is replication even running?
© 2011 - 2017 Percona, Inc. 16 / 102
Replicated EventsCannot async replicate to node3!What would the result be if we added node2 to thecluster?What can be done to be sure writes from replication intonode3 are replicated to the cluster?
© 2011 - 2017 Percona, Inc.
(*) Hint: If you’ve ever configured slave-of-slave in MySQL replication, you know the fix!
17 / 102
Fix ConfigurationMake this change:
[mysqld]...log-slave-updates...
Restart MySQL
# systemctl restart mysql@bootstrap
Why did we have to restart with bootstrap?
© 2011 - 2017 Percona, Inc. 18 / 102
Fix ConfigurationMake this change:
[mysqld]...log-slave-updates...
Restart MySQL
# systemctl restart mysql@bootstrap
Why did we have to restart with bootstrap?Cause this is still a cluster of 1 node.
© 2011 - 2017 Percona, Inc. 19 / 102
Checking Cluster State AgainUse 'myq_status' to check the state of Galera on node3:
node3# /usr/local/bin/myq_status wsrep
© 2011 - 2017 Percona, Inc. 20 / 102
Upgrade node2We only need a single node replicating from ourproduction master.
node2> STOP SLAVE; RESET SLAVE ALL;
Upgrade/Swap packages as beforenode2# systemctl stop mysql
node2# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57
Don't start mysql yet!Edit node2's my.cnf as you did for node3What needs to be different from node3's config?
© 2011 - 2017 Percona, Inc. 21 / 102
node2's Config[mysqld]# Leave existing settings and add these
binlog_format = ROWlog-slave-updates
# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node2 -- !!! CHANGE !!!wsrep_node_address = 192.168.70.3 -- !!! CHANGE !!!wsrep_sst_auth = sst:secret
innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = ON
Don't start MySQL yet!
© 2011 - 2017 Percona, Inc. 22 / 102
What's going to happen?Don't start MySQL yet!When node2 is started, what's going to happen?How does node2 get a copy of the dataset?
Can it use the existing data it already has?Do we have to bootstrap node2?
© 2011 - 2017 Percona, Inc. 23 / 102
State Snapshot Transfer (SST)Transfer a full backup of an existing cluster member(donor) to a new node entering the cluster (joiner).We configured our SST method to use ‘xtrabackup-v2’.
© 2011 - 2017 Percona, Inc. 24 / 102
Additional Xtrabackup SST SetupWhat about that sst user in your my.cnf?
[mysqld]...wsrep_sst_auth = sst:secret...
User/Grants need to be added for Xtrabackupnode3> CREATE USER 'sst'@'localhost';node3> ALTER USER 'sst'@'localhost' IDENTIFIED BY 'secret';node3> GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENTON *.* TO 'sst'@'localhost';
The good news is that once you get things figured out thefirst time, it's typically very easy to get an SST the firsttime on subsequent nodes.
© 2011 - 2017 Percona, Inc. 25 / 102
Up and Running node2Now you can start node2
node2# systemctl start mysql
Watch node2's error log for progressnode2# tail -f /var/lib/mysql/error.log
© 2011 - 2017 Percona, Inc. 26 / 102
Taking Stock of our AchievementsWe have a two node cluster:
Our production master replicates, asynchronously, intothe cluster via node3What might we do now in a real migration beforeproceeding?
© 2011 - 2017 Percona, Inc. 27 / 102
Finishing the MigrationHow should we finish the migration?How do we minimize downtime?What do we need to do to ensure there are no datainconsistencies?How might we rollback?
© 2011 - 2017 Percona, Inc. 28 / 102
Our Migration StepsShutdown the application pointing to node1Shutdown (and RESET) replication on node3 from node1
node3> STOP SLAVE; RESET SLAVE ALL;
Startup the application pointing to node3Edit run_sysbench_oltp.sh and change the IP./run_sysbench_oltp.sh to restart it
Rebuild node1 as we did for node2node1# systemctl stop mysqlnode1# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \-- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57
© 2011 - 2017 Percona, Inc. 29 / 102
node1's Config[mysqld]# Leave existing settings and add these
binlog_format = ROWlog-slave-updates
# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node1 -- !!! CHANGE !!!wsrep_node_address = 192.168.70.2 -- !!! CHANGE !!!wsrep_sst_auth = sst:secret
innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = ON
node1# systemctl start mysql
© 2011 - 2017 Percona, Inc. 30 / 102
Where are We Now?
© 2011 - 2017 Percona, Inc. 31 / 102
XtraDB Cluster TutorialGALERA REPLICATION INTERNALS
© 2011 - 2017 Percona, Inc. 32 / 102
Replication RolesWithin the cluster, all nodes are equalmaster/donor node
The source of the transaction.slave/joiner node
Receiver of the transaction.Writeset: A transaction; one or more row changes
© 2011 - 2017 Percona, Inc. 33 / 102
State Transfer RolesNew nodes joining an existing cluster get provisionedautomatically
Joiner = New nodeDonor = Node giving a copy of the data
State Snapshot transferFull backup of Donor to Joiner
Incremental Snapshot transferOnly changes since node left cluster
© 2011 - 2017 Percona, Inc. 34 / 102
What's inside a write-set?The payload; Generated from the RBR eventKeys; Generated by the source node
Primary KeysUnique KeysForeign Keys
Keys are how certification happesA PRIMARY KEY on every table is required.
© 2011 - 2017 Percona, Inc. 35 / 102
Replication StepsAs Master/Donor
ApplyReplicateCertifyCommit
As Slave/JoinerReplicate (receive)Certify (includes reply to master/donor)ApplyCommit
© 2011 - 2017 Percona, Inc. 36 / 102
What is Certification?In a nutshell: "Can this transaction be applied?"Happens on every node for all write-setsDeterministicResults of certification are not relayed back tomaster/donor
In event of failure, drop transaction locally; Possiblyremove self from cluster.On success, write-set enters apply queue
Serialized via group communication protocol
© 2011 - 2017 Percona, Inc.
(*) https://www.percona.com/blog/2016/04/17/how-percona-xtradb-cluster-certification-works/
37 / 102
The ApplyApply is done locally, asynchronously, on each node, aftercertification.Can be parallelized like standard async replication (moreon this later).If there is a conflict, will create brute force aborts (bfa) onlocal node.
© 2011 - 2017 Percona, Inc. 38 / 102
Last but not least, Commit StageLocal, on each node, InnoDB commit.Applier thread executes on slave/joiner nodes.Regular connection thread for master/donor node.
© 2011 - 2017 Percona, Inc. 39 / 102
Galera Replication
© 2011 - 2017 Percona, Inc. 40 / 102
Galera Replication Facts"Replication" is synchronous; slave apply is notA successful commit reply to client means a transactionWILL get applied across the entire cluster.A node that tries to commit a conflicting transaction beforewill get a deadlock error.
This is called a local certification failure or 'lcf'When our transaction is applied, any competingtransactions that haven't committed will get a deadlockerror.
This is called a brute force abort or 'bfa'
© 2011 - 2017 Percona, Inc. 41 / 102
Unexpected DeadlocksCreate a test table
node1 mysql> CREATE TABLE test.deadlocks (i INT UNSIGNED NOT NULL -> PRIMARY KEY, j VARCHAR(32) );node1 mysql> INSERT INTO test.deadlocks VALUES (1, NULL);node1 mysql> BEGIN; -- Start explicit transactionnode1 mysql> UPDATE test.deadlocks SET j = 'node1' WHERE i = 1;
Before we commit on node1, go to node3 in a separatewindow:
node3 mysql> BEGIN; -- Start transactionnode3 mysql> UPDATE test.deadlocks SET j = 'node3' WHERE i = 1;
node3 mysql> COMMIT;
node1 mysql> COMMIT;
node1 mysql> SELECT * FROM test.deadlocks;
© 2011 - 2017 Percona, Inc. 42 / 102
Unexpected Deadlocks (cont.)Which commit succeeds?Will only a COMMIT generate the deadlock error?Is this a lcf or bfa?How would you diagnose this error?Things to watch during this exercise:
myq_status (especially on node1)SHOW ENGINE INNODB STATUS
© 2011 - 2017 Percona, Inc. 43 / 102
Application HotspotsStart sysbench on all the nodes at the same time
$ /usr/local/bin/run_sysbench_update_index.sh
Watch myq_status 'Conflct' columns for activity
$ myq_status wsrep
Edit the script and adjust the following until you see someconflicts:
--oltp-table-size smaller (250)--tx-rate higher (75)
© 2011 - 2017 Percona, Inc. 44 / 102
When could you multi-node write?When your application can handle the deadlocksgracefully
wsrep_retry_autocommit can helpWhen the successful commit to deadlock ratio is not toosmall
© 2011 - 2017 Percona, Inc. 45 / 102
AUTO_INCREMENT HandlingPick one node to work on:
node2 mysql> create table test.autoinc ( -> i int not null auto_increment primary key, -> j varchar(32) );
node2 mysql> insert into test.autoinc (j) values ('node2' );node2 mysql> insert into test.autoinc (j) values ('node2' );node2 mysql> insert into test.autoinc (j) values ('node2' );
Now select all the data in the tablenode2 mysql> select * from test.autoinc;
What is odd about the values for 'i'?What happens if you do the inserts on each node inorder?Check wsrep_auto_increment_control andauto_increment% global variables
© 2011 - 2017 Percona, Inc. 46 / 102
The Effects of Large TransactionsSysbench on node1 as usual (Be sure to edit script andreset connection to localhost)
node1# run_sysbench_oltp.sh
Insert 100K rows into a new table on node2
node2 mysql> CREATE TABLE test.large LIKE test.sbtest1;node2 mysql> INSERT INTO test.large SELECT * FROM test.sbtest1;
Pay close attention to how the application (sysbench) isaffected by the large transactions
© 2011 - 2017 Percona, Inc. 47 / 102
Serial and Parallel ReplicationEvery node replicates, certifies, and commits EVERYtransaction in the SAME order
This is serializationReplication allows interspersing of smaller transactionssimultaneously
Big transactions tend to force serialization on thisThe parallel slave threads can APPLY transactions withno interdependencies in parallel on SLAVE nodes (i.e.,nodes where the trx originated from)
Infinite wsrep_slave_threads != infinite scalability
© 2011 - 2017 Percona, Inc. 48 / 102
XtraDB Cluster TutorialONLINE SCHEMA CHANGES
© 2011 - 2017 Percona, Inc. 49 / 102
How do we ALTER TABLEDirectly - ALTER TABLE
"Total Order Isolation" - TOI (default)"Rolling Schema Upgrades" - RSU
pt-online-schema-change
© 2011 - 2017 Percona, Inc. 50 / 102
Direct ALTER TABLE (TOI)Ensure sysbench is running against node1 as usualTry these ALTER statements on the nodes listed and note theeffect on sysbench throughput.
node1 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `m` varchar(32);node2 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `n` varchar(32);
Create a different table not being touched by sysbench
node2 mysql> CREATE TABLE test.foo LIKE test.sbtest1; node2 mysql> INSERT INTO test.foo SELECT * FROM test.sbtest1;node2 mysql> ALTER TABLE test.foo ADD COLUMN `o` varchar(32);
Does running the command from another node make adifference?What difference is there altering a table that sysbench is notusing?
© 2011 - 2017 Percona, Inc. 51 / 102
Direct ALTER TABLE (RSU)Change to RSU:
node2 mysql> SET wsrep_osu_method = 'RSU';node2 mysql> alter table test.foo add column `p` varchar(32);node2 mysql> alter table test.sbtest1 add column `q` varchar(32);
Watch myq_status and the error log on node2:
node2 mysql> alter table test.sbtest1 drop column `c`;
What are the effects on sysbench after adding the columns pand q?What happens after dropping 'c'?
Why?How do you recover?
When is RSU safe and unsafe to use?
© 2011 - 2017 Percona, Inc. 52 / 102
pt-online-schema-changeBe sure all your nodes are set back to TOI
node* mysql> SET GLOBAL wsrep_osu_method = 'TOI';
Let's add one final column:
node2# pt-online-schema-change --alter \ "ADD COLUMN z varchar(32)" D=test,t=sbtest1 --execute
Was pt-online-schema-change any better at doing DDL?Was the application blocked at all?What are the caveats to using pt-osc?
© 2011 - 2017 Percona, Inc. 53 / 102
XtraDB Cluster TutorialAPPLICATION HIGH AVAILABILITY
© 2011 - 2017 Percona, Inc. 54 / 102
ProxySQL Architecture
© 2011 - 2017 Percona, Inc. 55 / 102
Setting up HAProxyInstall HAproxy - Just on node1Create /etc/haproxy/haproxy.cfg
Copy from /root/haproxy.cfgRestart haproxyTest connection to proxy ports
node1# mysql -u test -ptest -h 192.168.70.2 -P 4306 \ -e "show variables like 'wsrep_node_name'"node1# mysql -u test -ptest -h 192.168.70.2 -P 5306 \ -e "show variables like 'wsrep_node_name'"
Test shutting down nodesWatch http://127.0.0.1:9991
© 2011 - 2017 Percona, Inc. 56 / 102
HAProxy Maintenance ModePut a node in to maintenance mode
node1# echo "disable server cluster-reads/node3" | socat \/var/lib/haproxy/haproxy.sock stdio
Refresh HAProxy status pageTest connections to port 5306
node3 is still online but HAProxy won't routeconnections it
Enable the node
node1# echo "enable server cluster-reads/node3" | socat \/var/lib/haproxy/haproxy.sock stdio
Refresh status page and test connections
© 2011 - 2017 Percona, Inc. 57 / 102
keepalived Architecture
© 2011 - 2017 Percona, Inc. 58 / 102
keepalived for simple VIPInstall keepalived (all nodes)Create /etc/keepalived/keepalived.conf
Copy from /root/keepalived.confsystemctl start keepalivedCheck for which host has the VIP
ip addr list | grep 192.168.70.100Experiment with restarting nodes while checking hostanswering the vip
while( true; ) do mysql -u test -ptest -h 192.168.70.100 --connect_timeout=5 \ -e "show variables like 'wsrep_node_name'";sleep 1;done
© 2011 - 2017 Percona, Inc. 59 / 102
Application HA EssentialsNodes are properly health checkedHA is mostly commonly handled at layer4 (TCP)
Therefore, expect node failures to mean appreconnects
The cluster will prevent apps from doing anything badBut, the cluster may not necessarily force apps toreconnect by itself
Watch out for Donor nodes
© 2011 - 2017 Percona, Inc. 60 / 102
XtraDB Cluster TutorialMONITORING GALERA
© 2011 - 2017 Percona, Inc. 61 / 102
At a Low Levelmysql> SHOW GLOBAL VARIABLES LIKE ‘wsrep%’;mysql> SHOW GLOBAL STATUS LIKE ‘wsrep%’;
Refer to:http://www.codership.com/wiki/doku.php?id=mysql_options_0.8
http://www.codership.com/wiki/doku.php?id=galera_status_0.8
http://www.percona.com/doc/percona-xtradb-cluster/wsrep-status-index.html
http://www.percona.com/doc/percona-xtradb-cluster/wsrep-system-
index.html
© 2011 - 2017 Percona, Inc. 62 / 102
Example Nagios Checksnode1# yum install percona-nagios-plugins -y
-- Verify we’re in a Primary clusternode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_status -C == -T str -c non-Primary
-- Verify the node is Syncednode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_local_state_comment -C '!=' -T str -w Synced
-- Verify the size of the cluster is normalnode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_size -C '<=' -w 2 -c 1
-- Watch for flow controlnode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_flow_control_paused -w 0.1 -c 0.9
© 2011 - 2017 Percona, Inc. 63 / 102
Alerting Best practicesAlerting should be actionable, not noise.Any other ideas of types of checks to do?http://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins
© 2011 - 2017 Percona, Inc. 64 / 102
Percona Monitoring and Management
© 2011 - 2017 Percona, Inc.
(*) https://pmmdemo.percona.com/
65 / 102
ClusterControl - SeveralNines
© 2011 - 2017 Percona, Inc. 66 / 102
XtraDB Cluster TutorialINCREMENTAL STATE TRANSFER
© 2011 - 2017 Percona, Inc. 67 / 102
Incremental State Transfer (IST)Helps to avoid full SSTWhen a node starts, it knowns the UUID of the cluster towhich it belongs and the last sequence number it applied.Upon connecting to other cluster nodes, broadcast thisinformation.If another node has cached write-sets, trigger IST
If no other node has, trigger SSTFirewall Alert: Uses port 4568 by default
© 2011 - 2017 Percona, Inc. 68 / 102
Galera Cache - "gcache"Write-sets are contained in the "gcache" on each node.Preallocated file with a specific size.
Default size is 128MRepresents a single-file circular buffer.Can be increase via provider option
wsrep_provider_options = "gcache.size=2G";
Cache is mmaped (I/O buffered to memory)wsrep_local_cached_downto - The first seqno present inthe cache for that node
© 2011 - 2017 Percona, Inc. 69 / 102
Simple IST TestMake sure your application is runningWatch myq_status on all nodesOn another node
watch mysql error logstop mysqlstart mysql
How did the cluster react when the node was stopped andrestarted?Was the messaging from the init script and in the logclear?Is it easy to tell when a node is a Donor for SST vs IST?
© 2011 - 2017 Percona, Inc. 70 / 102
XtraDB Cluster TutorialNODE FAILURE SCENARIOS
© 2011 - 2017 Percona, Inc. 71 / 102
Node Failure ExercisesRun sysbench on node1 as usual, also watch myq_statusExperiment 1: Clean node restart
node3# systemctl restart mysql
© 2011 - 2017 Percona, Inc. 72 / 102
Node Failure Exercises (cont.)Experiment 2: Partial network outageSimulate a network outage by using iptables to drop traffic
node3# iptables -A INPUT -s node1 \ -j DROP; iptables -A OUTPUT -s node1 -j DROP
After observation, clear rules
node3# iptables -F
© 2011 - 2017 Percona, Inc. 73 / 102
Node Failure Exercises (cont.)Experiment 3: node3 stops responding to the clusterAnother network outage affecting 1 node
-- Pay Attention to the command continuation backslashes!-- This should all be 1 line!node3# iptables -A INPUT -s node1 -j DROP; \iptables -A INPUT -s node2 -j DROP; \iptables -A OUTPUT -s node1 -j DROP; \iptables -A OUTPUT -s node2 -j DROP
Remove iptables rules (after experiments 2 and 3)
node3# iptables -F
© 2011 - 2017 Percona, Inc. 74 / 102
A Non-Primary ClusterExperiment 1: Clean stop of 2/3 nodes (normal sysbenchload)
node2# systemctl stop mysqlnode3# systemctl stop mysql
© 2011 - 2017 Percona, Inc. 75 / 102
A Non-Primary Cluster (cont.)Experiment 2: Failure of 50% cluster (leave node2 down)
node3# systemctl start mysql
-- Wait a few minutes for IST to finish and for the-- cluster to become stable. Watch the error logs and-- SHOW GLOBAL STATUS LIKE 'wsrep%'
node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP
Force node1 to become primary like this:node1 mysql> set global wsrep_provider_options="pc.bootstrap=true";
Clear iptables when donenode3# iptables -F
© 2011 - 2017 Percona, Inc. 76 / 102
Using garbd as a third nodeNode2’s mysql is still stopped!Install the arbitrator daemon on to node2
node2# yum install Percona-XtraDB-Cluster-garbd-3.x86_64node2# garbd -a gcomm://node1:4567,node3:4567 -g mycluster
Experiment 1: Reproduce node3 failurenode3# iptables -A INPUT -s node1 -j DROP; \ iptables -A INPUT -s node2 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node2 -j DROP
© 2011 - 2017 Percona, Inc. 77 / 102
Using garbd as a third node (cont.)Experiment 2: Partial node3 network failure
node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP
Remember to clear iptables when donenode3# iptables -F
© 2011 - 2017 Percona, Inc. 78 / 102
Recovering from a clean all-down stateStop mysql on all 3 nodes
$ systemctl stop mysql
Use the grastate.dat to find the node with the highestGTID
$ cat /var/lib/mysql/grastate.dat
Bootstrap that node and then start the others normally
$ systemctl start mysql@bootstrap$ systemctl start mysql$ systemctl start mysql
Why do we need to bootstrap the node with the highestGTID?Why can’t the cluster just “figure it out”?
© 2011 - 2017 Percona, Inc. 79 / 102
Recovering from a dirty all-down stateKill -9 mysqld on all 3 nodes
$ killall -9 mysqld_safe; killall -9 mysqld
Check the grastate.dat again. Is it useful?
$ cat /var/lib/mysql/grastate.dat
Try wsrep_recover on each node to get the GTID
$ mysqld_safe --wsrep_recover
Bootstrap the highest node and then start the othersnormallyWhat would happen if we started the wrong node?Wsrep Recovery pulls GTID data from the Innodbtransaction logs. In what cases might this be inaccurate?
© 2011 - 2017 Percona, Inc. 80 / 102
XtraDB Cluster TutorialAVOIDING SST
© 2011 - 2017 Percona, Inc. 81 / 102
Manual State Transfer with XtrabackupTake a backup on node3 (sysbench on node1 as usual)
node3# xtrabackup --backup --galera-info --target-dir=/var/backup
Stop node3 and prepare the backup
node3# systemctl stop mysqlnode3# xtrabackup --prepare --target-dir=/var/backup
© 2011 - 2017 Percona, Inc. 82 / 102
Manual State Transfer with XtrabackupCopy back the backup
node3# rm -rf /var/lib/mysql/*node3# xtrabackup --copy-back --target-dir=/var/backupnode3# chown -R mysql.mysql /var/lib/mysql
Get GTID from backup and update grastate.datnode3# cd /var/lib/mysqlnode3# mv xtrabackup_galera_info grastate.datnode3# $EDIT grastate.dat
# GALERA saved stateversion: 2.1uuid: 8797f811-7f73-11e2-0800-8b513b3819c1seqno: 22809safe_to_bootstrap: 0
node3# systemctl start mysql
© 2011 - 2017 Percona, Inc. 83 / 102
Cases where the state is zeroedAdd a bad config item to section [mysqld] in /etc/my.cnf
[mysqld]fooey
Check what happens to the grastate:
node3# systemctl stop mysqlnode3# cat /var/lib/mysql/grastate.datnode3# systemctl start mysqlnode3# cat /var/lib/mysql/grastate.dat
Remove the bad config and use wsrep_recover to avoid SST
node3# mysqld_safe --wsrep_recover
Update the grastate.dat with that position and restartWhat would have happened if you restarted with a zeroedgrastate.dat?
© 2011 - 2017 Percona, Inc. 84 / 102
When you WANT an SSTnode3# cat /var/lib/mysql/grastate.dat
Turn off replication for the session:
node3 mysql> SET wsrep_on = OFF;
Repeat until node3 crashes (sysbench should be runningon node1)
node3 mysql> delete from test.sbtest1 limit 10000;..after crash..node3# cat /var/lib/mysql/grastate.dat
What is in the error log after line the crash?Should we try to avoid SST in this case?
© 2011 - 2017 Percona, Inc. 85 / 102
Performing ChecksumsCreate a User
node1 mysql> CREATE USER 'checksum'@'%';node1 mysql> ALTER USER 'checksum'@'%' IDENTIFIED BY 'checksum1';node1 mysql> GRANT SUPER, PROCESS, SELECT ON *.* TO 'checksum'@'%';node1 mysql> GRANT CREATE, SELECT, INSERT, UPDATE, DELETE ON percona.* TO 'checksum'@'%';
Create Dummy Tablenode1 mysql> CREATE TABLE test.fooey (id int PRIMARY KEY, name varchar(20));node1 mysql> INSERT INTO test.fooey VALUES (1, 'Matthew');
Verify row exists on all nodes
© 2011 - 2017 Percona, Inc. 86 / 102
Performing Checksums (cont.)Run Checksum
node1# pt-table-checksum -d test --recursion-method cluster \-u checksum -p checksum1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE04-23T16:42:20 0 0 1 1 0 0.039 test.fooey04-23T16:37:38 0 0 250000 5 0 1.931 test.sbtest1
Change data without replicating!node3 mysql> SET wsrep_on = OFF;node3 mysql> UPDATE test.fooey SET name = 'Charles';
Re-run checksum command above on node1Who has the error?
ALL mysql> SELECT COUNT(*) FROM checksums WHERE this_crc<>master_crc;
© 2011 - 2017 Percona, Inc. 87 / 102
Fix Bad Data!Use pt-table-sync between a good node and the bad node
node1# pt-table-sync --execute h=node1,D=test,t=fooey h=node3
© 2011 - 2017 Percona, Inc. 88 / 102
XtraDB Cluster TutorialTUNING REPLICATION
© 2011 - 2017 Percona, Inc. 89 / 102
What is Flow Control?Ability for any node in the cluster to tell the other nodes topause writes while it catches upFeedback mechanism for the replication processONLY caused by wsrep_local_recv_queue exceeding anode's fc_limitThis can pause the entire cluster and look like a clusterstall!
© 2011 - 2017 Percona, Inc. 90 / 102
Observing Flow ControlRun sysbench on node1 as usualTake a read lock on node3 and observe its effect on thecluster
node3 mysql> SET GLOBAL pxc_strict_mode = 0;node3 mysql> LOCK TABLE test.sbtest1 WRITE;
Make observations. When you are done, release the lock:
node3 mysql> UNLOCK TABLES;
What happens to sysbench throughput?What are the symptoms of flow control?How can you detect it?
© 2011 - 2017 Percona, Inc. 91 / 102
Tuning Flow ControlRun sysbench on node1 as usualIncrease the fc_limit on node3
node3 mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=500";
Take a read lock on node3 and observe its effect on thecluster
node3 mysql> LOCK TABLE test.sbtest1 WRITE;
Make observations. When you are done, release the lock:
node3 mysql> UNLOCK TABLES;
Now, how big does the recv queue on node3 get beforeFC kicks in?
© 2011 - 2017 Percona, Inc. 92 / 102
Flow Control Parametersgcs.fc_limit
Pause replication if recv queue exceeds that manywritesets.Default: 100. For master-slave setups this number canbe increased considerably.
gcs.fc_master_slaveWhen this is NO, then the effective gcs.fc_limit ismultipled by sqrt( number of cluster members ).Default: NO.Ex: gcs.fc_limit = 500 * sqrt(3) = 866
© 2011 - 2017 Percona, Inc. 93 / 102
Avoiding Flow Control during BackupsMost backups use FLUSH TABLES WITH READ LOCKBy default, this causes flow control almost immediatelyWe can permit nodes to fall behind:
node3 mysql> SET GLOBAL wsrep_desync=ON;node3 mysql> FLUSH TABLES WITH READ LOCK;
Do not write to this node!No longer the case with 5.7! Yay!
© 2011 - 2017 Percona, Inc. 94 / 102
More Notes on Flow ControlIf you set it too low:
Too many cries for help from random nodes in thecluster
If you set it too high:Increases in replication conflicts in multi-node writingsIncreases in apply/commit lag on nodes
That one node with FC issues:Deal with that node: bad hardware? not same specs?
© 2011 - 2017 Percona, Inc. 95 / 102
Testing Node Apply throughputUsing wsrep_desync and LOCK TABLE, we can stopreplication on a node for an arbitrary amount of time (untilit runs out of disk-space).We can stop a node for a while, release it, and measurethe top speed at which it applies.This gives is an idea of how fast the node is capable ofhandling the bulk of replication.Performance here is greatly improved in 5.7(*)
© 2011 - 2017 Percona, Inc.
(*) https://www.percona.com/blog/2017/04/19/how-we-made-percona-xtradb-cluster-scale/
96 / 102
Multiple Applier ThreadsUsing the wsrep_desync trick, you can measure thethroughput of a slave with any setting ofwsrep_slave_threads you wish.wsrep_cert_deps_distance is the maximum "theoretical"parallel threads
Practically, it's likely you should probably have muchfewer slave threads (500 is a bit much)
Parallel apply has been a source of node inconsistenciesin the past
© 2011 - 2017 Percona, Inc. 97 / 102
Wsrep Sync WaitApply is asynchronous, so reads after writes on differentnodes may see an older view of the database.Practically, flow control prevents this from becoming likeregular Async replicationTighter consistency can be enforced by ensuring thereplication queue is flushed before a read
SET <session|global> wsrep_sync_wait=[0-7];
© 2011 - 2017 Percona, Inc. 98 / 102
XtraDB Cluster TutorialCONCLUSION
© 2011 - 2017 Percona, Inc. 99 / 102
Common Configuration Parameters
wsrep_sst_auth = un:pw Set SST Authentication
wsrep_osu_method = 'TOI' Set DDL Method
wsrep_desync = ON; Enable "Replication"
wsrep_slave_threads = 4; Number of applier threads
wsrep_retry_autocommit = 4; Number of commit retries
wsrep_provider_options:
* gcs.fc_limit = 500 Increase flow control threshold
* pc.bootstrap = true Put node into single mode
* gcache.size = 2G Increase Galera Cache file
wsrep_sst_donor = node1 Specify donor node
© 2011 - 2017 Percona, Inc. 100 / 102
Common Status Counters
wsrep_cluster_status Primary / Non-Primarywsrep_local_state_comment Synced / Donor / Desyncwsrep_cluster_size Number of online nodes
wsrep_flow_control_paused Number of seconds clusterhas paused
wsrep_local_recv_queue Current number of write-setswaiting to be applied
© 2011 - 2017 Percona, Inc. 101 / 102
Questions?
© 2011 - 2017 Percona, Inc.
(*) Slide content authors: Jay Janssen, Fred Descamps, Matthew Boehm
102 / 102
PXC - Hands Tutorial
ist/sst donor selection
● PXC has concept of segment (gmcast.segment)
Check if you can find out gmcast.segment (by querying select @@wsrep_provider_options)
● Assign same segment id to cluster node that are co-located together in same subnet/LAN.● Assign different segment id to cluster node that are located in different subnet/WAN
SELECT SUBSTRING(@@wsrep_provider_options, LOCATE('gmcast.segment', @@wsrep_provider_options), 18);
concept of segment
exercise
subnet-2subnet-1
gmcast.segment=0 gmcast.segment=1
● SST DONOR Selection○ Search by name (if specified by user --wsrep_sst_donor).
■ If node state restrict it from acting as DONOR things get delayed.
Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable
○ Search donor by state:■ Scan all nodes.■ Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator).
● YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR● NO: Keep searching but cache this remote segment too. Worse case will use this.
sst donor selection
● cluster see workload
ist donor selection
n1
n3n2
wset-1
wset-2
● workload is processed and replicated on all the nodes
ist donor selection
n1
n3n2
wset-1
wsetwset-1
wset-2
wset-2 wset-2
● n3 needs to leave cluster for sometime
ist donor selection
n1
n3
n2
wset-1
wset-1
wset-1
wset-2
wset-2wset-2
● cluster makes further progress
ist donor selection
n1
n3
n2
wset-1
wset-1
wset-3
wset-3
wset-2
wset-2
wset-4
wset-4
wset-1
wset-2
● n1/n2 needs to purge gcache.
ist donor selection
n1
n3
n2
wset-1
wset-1
wset-3
wset-3
wset-2
wset-2
wset-4
wset-1
wset-2
wset-4
● n3 need to re-join the cluster.
● Which node should it select and Why ?
ist donor selection
n1
n3
n2
wset-1
wset-1
wset-3
wset-3
wset-2
wset-2
wset-4
wset-1
wset-2
wset-4
● n3 will select n2 given that n2 has purged enough and probability of it purging further (with respect to n1) is low.
● wsrep_local_cached_downto is variable to look for. It indicate the lowest sequence number held.
● Concept of Safety-Gap.
ist donor selection
n1
n3
n2
wset-1
wset-1
wset-3
wset-3
wset-2
wset-2
wset-4
wset-1
wset-2
wset-4
● Test ?
○ 4 nodes n1, n2, n3, n4.○ n1 and n2 are in one-segment○ n3 and n4 are in another-segment○ n3 needs to leave cluster. n3 has trx upto write-set (100).○ Cluster make progress and purges too.○ Latest available write-sets: n1 (200), n2 (250), n4 (200).○ n3 want to re-join. Which DONOR will selected and Why ?
ist donor selection
galera gtid
● MySQL GTID vs Galera GTID○ MySQL GTID is based on server_uuid. This means it is different on all the nodes.○ Galera GTID is used explicitly for replication and it same on all the nodes that are part of the cluster.
select @@sever_uuid;select @@global.gtid_executed;show status like ‘wsrep_cluster_state_uuid’;
● Global Sequence Number:○ Incoming queue is like FIFO○ Each write-set is delivered to all the nodes of cluster in same order.
● Galera GTID is assigned only for successful action.
6a088373-da4b-ee18-6bf3-f0b4e6e19c7d:1-3
gtid and global-seqno
exercise
understanding timeout
understanding timeout
N1 N3N2
Group Channel
State-1: 3 nodes cluster all are healthy. When there is no workload keep_alive signal is sent every 1 sec
N1 N3N2
Group ChannelState-2: N1 has flaky network connection and can’t respond in set inactive_check_period. Cluster (N2, N3) mark N1 to be added delayed list and adds it once it crosses delayed margin.
N1 N3N2
Group ChannelState-3: N1 network connection is restored but to get it out from delayed listed N2 and N3 needs assurance from N1 for delayed_keep_period.
N1 N3N2
Group ChannelState-4: N1 again has flaky connection and it doesn’t respond back for inactive_timeout. (Node is already in delayed list). This will cause node to pronounce as DEAD.
geo-distributed cluster
● Nodes of cluster are distributed across data-centers still all nodes are part of one-big cluster.
geo-distributed cluster
N1
N3
N2 N4
N6
N5
WAN
● Important aspect to consider:
○ gmcast.segment that help notify topology. All local nodes should have same gmcast.segment. This
helps in optimizing replication traffic and of-course SST and IST.
○ Tune timeouts if needed for your WAN setup.
○ Avoid flow-control and fragmentation at galera level (example settings:
gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=256")
pxc-5.7
● PXC-5.7 GAed during PLAM-2016 (Sep-2016)○ Since then we have done 3 more releases
Percona XtraDB Cluster 5.7
Sep-16
PXC-5.7.14-26.17
Dec-16 Mar-17 Apr-17
PXC-5.7.16-27.19
PXC-5.7.17-27.20
PXC-5.7.17-29.20
● Cluster-Safe Mode (pxc-strict-mode)● Extended support for PFS● Support for encrypted tablespace● Improved security● ProxySQL compatible PXC● Tracking IST/Flow-control● Improved logging and wsrep-stage frame-work● Fully Scalable Performance Optimized PXC● PXC + PMM● Simplified packaging
Percona XtraDB Cluster 5.7
cluster-safe-mode
to block experimental features
cluster-safe-mode
Why pxc-strict-mode ?
binlog_format=row
Nontransactional SE
table-wo-primary-key
LOCAL LOCKS
LOCAL Ops (ALTER IMPORT/DISCARD)
CTAS
● pxc_strict_mode
● Mode is local to the given node. All nodes ideally should operate in same mode.● Not applicable on replicated traffic. Applicable only on user-traffic.● Moving to stricter mode, make sure validation checks passes.
cluster-safe-mode
ENFORCING=ERRORMASTER=ERROR
(except locks)
PERMISSIVE=WARNINGDISABLED=5.6-compatible
● Try to insert into a non-pk table.
mysql> use test;Database changed
mysql> create table t (i int) engine=innodb;Query OK, 0 rows affected (0.04 sec)
mysql> insert into t values (1);ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.t) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER
mysql> alter table t add primary key pk(i);Query OK, 0 rows affected (0.06 sec)Records: 0 Duplicates: 0 Warnings: 0
mysql> insert into t values (1);Query OK, 1 row affected (0.01 sec)
cluster-safe-mode
exercise
● Try to alter myisam table to innodb
mysql> alter table t engine=myisam; drop table t;ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits changing storage engine of a table (test.t) from transactional to non-transactional with pxc_strict_mode = ENFORCING or MASTER
mysql> create table myisam (i int) engine=myisam;Query OK, 0 rows affected (0.01 sec)
mysql> alter table myisam engine=innodb;Query OK, 0 rows affected (0.04 sec)Records: 0 Duplicates: 0 Warnings: 0
mysql> show create table myisam; drop table myisam;+--------+-------------------------------------------------------------------------------------------+| Table | Create Table |+--------+-------------------------------------------------------------------------------------------+| myisam | CREATE TABLE `myisam` ( `i` int(11) DEFAULT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin1 |+--------+-------------------------------------------------------------------------------------------+
cluster-safe-mode
exercise
● Switching to DISABLED and back to ENFORCING
mysql> set global pxc_strict_mode='DISABLED';Query OK, 0 rows affected (0.00 sec)
mysql> create table myisam (i int) engine=myisam;Query OK, 0 rows affected (0.01 sec)
mysql> set wsrep_replicate_myisam=1;Query OK, 0 rows affected (0.00 sec)
mysql> insert into myisam values (10);Query OK, 1 row affected (0.00 sec)
Check if you see entry (10) on node-2.
What will happen if we now switch
BACK from DISABLED TO ENFORCING ?
cluster-safe-mode
exercise
● Back to ENFORCING
mysql> set global pxc_strict_mode='ENFORCING';ERROR 1105 (HY000): Can't change pxc_strict_mode to ENFORCING as wsrep_replicate_myisam is ON
mysql> set wsrep_replicate_myisam=0;Query OK, 0 rows affected (0.00 sec)
mysql> set global pxc_strict_mode='ENFORCING';Query OK, 0 rows affected (0.00 sec)
node1 mysql> drop table myisam;Query OK, 0 rows affected (0.00 sec)
Switching will check for the following preconditions:● wsrep_replicate_myisam = 0● binlog_format = ROW● log_output = NONE/FILE● isolation level != SERIALIZABLE
cluster-safe-mode
exercise
● Replicated traffic is not scrutinized (not for TOI statement MyISAM DML).node-1:mysql> set global pxc_strict_mode='DISABLED';Query OK, 0 rows affected (0.00 sec)
mysql> create table t (i int) engine=innodb;Query OK, 0 rows affected (0.06 sec)
mysql> insert into t values (100);Query OK, 1 row affected (0.01 sec)
mysql> set global pxc_strict_mode='ENFORCING';Query OK, 0 rows affected (0.00 sec)
node-2:mysql> select * from test.t;+------+| i |+------+| 100 |+------+1 row in set (0.00 sec)
cluster-safe-mode
exercise
Remove all tables we created in this experiment.
(t, myisam). Check the modes are reset back to
ENFORCING
pxc+pfs
● PFS has concept of instruments. DB internal objects can be called instrument. Like mutexes,
condition variables, files, threads, etc...
● User can decide what all instruments to monitor based on need.
● PXC objects to monitor:
○ PXC threads (doesn’t generate event directly)
○ PXC files (doesn’t generate event directly)
○ PXC stages
○ PXC mutexes
○ PXC condition variables
performance-schema
● Applier/Slave Thread(s)
● Rollback/Aborter Thread
● IST Receiver/Sender Thread
● SST Joiner/Donor Thread
● Gcache File Remove Thread
● Receiver Thread
● Service Thread
● GComm Thread
● Write-set Checkum Thread
performance-schema: threads
● Ring Buffer File (GCache)
● GCache Overflow Page File
● RecordSet File
● grastate file/gvwstate file
performance-schema: files
● Any PXC action passes through different stages :
○ Writing/Deleting/Updating Rows
○ Applying write-sets
○ Commit/Rollback
○ Replicating Commit
○ Replaying
○ Preparing for TO Isolation
○ Applier Idle
○ Rollback/Aborted Idle
○ Aborter Active
performance-schema: stages
● Mutex/Cond-Variable/Monitor.
● Mutexes/Cond-Variables needed for coordination at each stage:
○ wsrep_ready, wsrep_sst_init, wsrep_sst, wsrep_rollback,
wsrep_slave_threads, wsrep_desync, wsrep_sst_thread,
wsrep_replaying.
○ certification mutex, stats collection mutex, monitor mutexes (that
control parallel replication), recvbuf mutex, transaction handling mutex,
saved state mutexes, etc.. and corresponding condition-variables.
● Monitor is like mutex that enforce ordering. Only single thread is allowed to
reside in monitor and which threads goes next depends on global_seqno.
performance-schema: mutex/cvar
● Mainly to monitor the cluster
○ For performance monitoring and tuning
○ What is taking time ? Certification ? Replication ? Commit ?
○ What is current state of the processing thread (processing + commit wait)
○ Track transaction rollback incident ?
○ …. etc.
○ More to come from performance analysis perspective in future.
why use pfs in pxc ?
● Let’s track some wsrep-stages
● Enable instrument to track on both node of cluster. (node-1, node-2)use performance_schema;
update setup_instruments set enabled=1, timed=1 where name like 'stage%wsrep%';
update setup_consumers set enabled=1 where name like 'events_stages_%';
● Start sysbench workload
● Let’s start tracking some state and time it takes.mysql> select thread_id, event_name, timer_wait/100000000 as msec from events_stages_history where event_name like '%in replicate
stage%' order by msec desc limit 10;
+-----------+---------------------------------------+--------+
| thread_id | event_name | msec |
+-----------+---------------------------------------+--------+
| 42 | stage/wsrep/wsrep: in replicate stage | 0.0506 |
● Let’s now see the Slave View
performance schema
Remember to disable PFS back to off once done.
exercise
● Let’s track conflict
● Enable instrument to track on both node of cluster. (node-1, node-2)use performance_schema;
update setup_instruments set enabled=’YES’, timed=’YES’ where name like 'wait%wsrep%';
update setup_consumers set enabled=’YES’ where name like '%wait%';
● Run a conflicting workload (bf_abort)
● Track the events generated.mysql> select thread_id, event_id, event_name, timer_wait/1000000000000 from events_waits_history_long where event_name like
'%COND%rollback%';
+-----------+----------+-----------------------------------------+--------------------------+
| thread_id | event_id | event_name | timer_wait/1000000000000 |
+-----------+----------+-----------------------------------------+--------------------------+
| 7 | 12 | wait/synch/cond/sql/COND_wsrep_rollback | 191.6299 |
| 7 | 18 | wait/synch/cond/sql/COND_wsrep_rollback | 117.6499 |
+-----------+----------+-----------------------------------------+--------------------------+
2 rows in set (0.00 sec)
performance schema
Remember to disable PFS back to off once done.
exercise
pxc + secure
● MySQL-5.7 introduced an encrypted tablespace.
● If your existing node of cluster (DONOR) has encrypted tablespace then you should
able to retain the same encrypted tablespace in encrypted fashion after donating it to
the JOINER.
● PXC takes care of moving the keyring (needed for unencrypting tablespace) and then
re-encrypting it with new keyring. (All this is done with help of XB (Xtrabackup)). All this
is done transparently.
● Since the keyring is transmitted between the 2 nodes it is important to ensure that SST
connection is encrypted and both the node should have keyring plugin configured.
support for encrypted tablespace
● PXC internode traffic needs to be secured. There are 2 main type of traffic
○ SST that is during the initial node-setup with involvement of 3rd party tool (XB).
○ IST/Replication traffic. Internal between PXC node.
● PXC-5.7 supports a mode encrypt=4 that accept the familiar triplet of ssl-key/ca/cert
just like MySQL-Client server communication. Remember to enable it on all nodes.
[sst]
encrypt=4
ssl-ca=<PATH>/ca.pem
ssl-cert=<PATH>/server-cert.pem
ssl-key=<PATH>/server-key.pem
● MySQL auto-generate these certificates during bootstrap you can continue to use them but ensure
all nodes uses same file or same ca-file for generating server-cert.
improved security
● For IST/Replication-traffic already supports triplet and here is way to specify it
wsrep_provider_options="socket.ssl=yes;socket.ssl_key=/path/to/server-key.pem;socket.ssl_cert=/path/to/server-c
ert.pem;socket.ssl_ca=/path/to/cacert.pem"
improved security
WAIT!!!! I am first time user. This sounds bit tedious.
● pxc-encrypt-cluster-traffic help to secure SST/IST/Replication traffic by passing
default key/ca/cert generated by MySQL.
● Just set this variable to ON (pxc-encrypt-cluster-traffic=ON). It will looks for files in
mysqld section (One that mysql uses for its own SSL connection) or in data directory for
auto-generated files. It then auto-pass them for SST/IST secure connection.
OK . that’s simple NOW !!!!
https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/encrypt-traffic.html
improved security
pxc + proxysql
● ProxySQL is one of the leading load balancer.
● PXC is fully integrated with ProxySQL.
○ ProxySQL can monitor PXC nodes
○ Percona has developed a simple single step solution to auto-configure proxysql
tracking tables by auto-discovering pxc nodes. (proxysql-admin)
○ Same script can be used to also configure different setup
■ LoadBalancer: Each node is marked for read/write
■ SingleWriter: Only one node service WRITE. Other nodes can be used for
READ.
pxc + proxysql
● Often it is important to take a node down for some maintenance.
○ Shutdown the server
○ No load on server
● Naturally a user would like to perform these operation directly on node but is could be
risky with LoadBalancer involved as it may still be servicing it as active load query.
● To make easier we now allow user to set this using a single step process
● pxc_maint_mode (default=DISABLED)
● pxc_maint_transition_period (default=10)
proxyql assisted pxc mode
pxc_maint_mode
proxyql assisted pxc mode
MAINTENANCE
On completion of maintenance user can switch it back to DISABLED.
SHUTDOWN
DISABLED
On shutdown of node.Node wait for pxc_maint_transition_period before delivering shutdown signal.
User switch the mode to MAINTENANCE.Node wait for pxc_maint_transition_period before returning prompt but mark the status updated immediately allowing proxysql to update the status
ProxySQL
Monitor for change in status and accordingly update the internal table to avoid sending traffic
proxyql assisted pxc mode
● Install ProxySQL.cd /var/repo; yum install proxysql-1.3.5-1.1.x86_64.rpm
which proxysql
proxysql --version
which proxysql-admin
proxysql-admin -v
cat /etc/proxysql-admin.cnf
node1 mysql> CREATE USER ‘admin’@’%’ IDENTIFIED by ‘admin’;
node1 mysql> GRANT all privileges on *.* to ‘admin’@’%’;
node1 mysql> FLUSH privileges;exercise
proxyql assisted pxc mode● Ensure proxysql is running. If not start proxysql in foreground using “proxysql -f”
mysql -uadmin -padmin -P 6032 -h 127.0.0.1 (.... this will open proxysql terminal)
proxysql-admin -e --write-node="192.168.70.2:3306" (....proxysql user doesn’t have privileges)
node1 mysql> GRANT all privileges on *.* to ‘proxysql_user’@’192.%’;
node1 mysql> FLUSH privileges;
● Now check and ensure the said information in proxysqlmysql -h 127.0.0.1 -P 6032 -uadmin -padmin (connecting to proxysql)
select * from mysql_users; .... mysql_servers; schedulers;
● Check if proxysql is able to route connection to servermysql --user=proxysql_user --password=passw0rd --host=localhost
--port=6033 --protocol=tcp -e “select @@hostname”
exercise
● Check the routing rules through mysql_server table of proxysql.node1 mysql> select * from mysql_servers;
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10 | 192.168.70.2 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | WRITE |
| 11 | 192.168.70.3 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
| 11 | 192.168.70.4 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |
+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)
● Start monitoring to check if the proxysql is multiplexing the query among 2 host.watch -n 0.1 "mysql -u proxysql_user -P 6033 -ppassw0rd
-h 192.168.70.3 --protocol=tcp -e \"select @@wsrep_node_name\""
proxyql assisted pxc mode
exercise
● Let’s shutdown node-2 and check multiplexing.
Monitor statuswatch -n 0.1 "mysql -uadmin -padmin -P 6032 -h 127.0.0.1 -e \"select status from mysql_servers\""
Monitor pxc_maint_modewatch -n 0.1 "mysql -e \"select @@pxc_maint_mode\""
Shutdown and then restart node-2systemctl stop mysql (check if multiplexing now stabilizes to node3)
systemctl start mysql
proxyql assisted pxc mode
exercise
track more using show status
● Often a node is taken down (shutdown of server) for some maintenance and re-starting it can cause
IST. Size of IST depends on write-set processed. It can take noticeable time for IST activity to
complete (While IST is in progress node is not usable).
● Now we can track IST progress through ‘wsrep_ist_receive_status’| wsrep_ist_receive_status | 75% complete, received seqno 1980 of 1610-2103 |
watch -n 0.1 "mysql -e \"show status like 'wsrep_ist_receive_status'\""
Let’s see this happening. We will shutdown node-2
do some quick 30 sec OLTP workload on node-1
and then restart node-2 and track IST.
Check what is the wsrep_last_committed post IST.
tracking ist
exercise
● Flow-control is self-regulating mechanism to avoid a case where-in weakest/slower node the cluster
falls behind main-cluster by a significant gap.
● Each replicating node has a incoming/receiving queue.
○ Size of this queue is configurable.
○ Default for now is 100. (Was 16 initially)
○ If there are multiple nodes in cluster then this size is adjusted based on number of nodes.
(sqrt(number-of-nodes) * configured size)
○ Size configurable through gcs.fc_limit variable.Default for now is 100. (Was 16 initially)
Setting on one node is not replicated to other nodes.
tracking flow control
● Starting new release this size is exposed through a variable 'wsrep_flow_control_interval'.
mysql> show status like ' wsrep_flow_control_interval';
+-----------------------------+--------------+
| Variable_name | Value |
+-----------------------------+--------------+
| wsrep_flow_control_interval | [ 141, 141 ] |
+-----------------------------+--------------+
1 row in set (0.01 sec)
● Concept of lower and higher water mark. (gcs.fc_factor)
Exercise time: We will change
flow-control limit using
gcs.fc_limit and see if the effect.
tracking flow control
exercise
● Often a node enters flow-control depending on size of the queue and load of the server.
● It is important to help track if the node is in flow-control. Till date this was difficult to find out.
● Now we have an easy way out wsrep_flow_control_statusmysql> show status like 'wsrep_flow_control_status';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| wsrep_flow_control_status | ON |
+---------------------------+-------+
● Variable is boolean and will be turned ON and OFF depending on if node is in flow-control.
We will force node to pause and see
if the FLOW_CONTROL KICKS IN.
* sysbench based flow-control
tracking flow control
exercise
● Based on the reading so far it may sound interesting to increase flow-control as the node will
eventually catch-up. So why not increase flow-control to a larger value (say 10K, 25K ?)
● It has 3 main back-falls
(Say you set fc_limit to 10K and you have 8K write-sets replicated on slave waiting to get applied)
○ If slave fails then it loses those 8K write-sets and on restart will demand those 8K+ write-set through
IST (even if you immediately restart the node in few seconds).
○ If the DONOR gcache doesn’t have these write-sets since the gap is large it will invite SST that could
be time-consuming.
○ If you care about fresh data and have wsrep_sync_wait=7 firing SELECT query will place it at 8K+1
slot and will wait for those 8K write-set to get serviced.. Increasing SELECT (DML) latency
effect of increasing flow control
improved logging and stages
● We have improved sst/ist logging. Will love to hear back if you have more comments on this front.
● Major changes:
○ Made logging structure easy to grasp.
○ FATAL error are now highlighted to sight them easily.
○ In case of FATAL error XB logs are appended to main node logs.
○ Lot of sanity checks inserted to help catch error early.
○ wsrep-log-debug=1 ([sst] section) to suppress those informational message for normal run.
● Some major causes of error
○ Wrong config params (usually credential or port or host)
○ Stale process from last run or some other failed attempt left out.
improved sst/ist logging
2017-04-14T10:08:30.336898Z 1 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) ->
new(461a9eac-20fa-11e7-a31c-4f98008f8d11:0)
20170414 15:38:30.533 WSREP_SST: [INFO] Proceeding with SST.........
20170414 15:38:30.567 WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
2017-04-14T10:08:32.451749Z 0 [Note] WSREP: (4b82fdcc, 'tcp://192.168.1.4:5030') turning message relay requesting off
2017-04-14T10:08:42.283267Z 0 [Note] WSREP: 0.0 (n1): State transfer to 1.0 (n2) complete.
2017-04-14T10:08:42.283423Z 0 [Note] WSREP: Member 0.0 (n1) synced with group.
20170414 15:38:43.075 WSREP_SST: [INFO] Preparing the backup at /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2//.sst
20170414 15:38:46.879 WSREP_SST: [INFO] Moving the backup to /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/
20170414 15:38:46.901 WSREP_SST: [INFO] Galera co-ords from recovery: 461a9eac-20fa-11e7-a31c-4f98008f8d11:0
2017-04-14T10:08:46.913474Z 0 [Note] WSREP: SST complete, seqno: 0
improved sst/ist logging
2017-04-14T10:11:09.452740Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-04-14T10:11:09.452837Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner'
--address '192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file
'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' )
20170414 15:41:09.523 WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
20170414 15:41:09.525 WSREP_SST: [ERROR] The xtrabackup version is 2.3.5. Needs xtrabackup-2.4.4 or higher to perform SST
20170414 15:41:09.526 WSREP_SST: [ERROR] ******************************************************
2017-04-14T10:11:09.527081Z 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address
'192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file
'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin'
Read: '(null)'
2017-04-14T10:11:09.527123Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address
'192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file
'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' : 2 (No
such file or directory)
2017-04-14T10:11:09.527169Z 2 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2017-04-14T10:11:09.527189Z 2 [ERROR] Aborting
improved sst/ist logging
● wsrep-stages are important to track what the given thread is working on at any given time.
● MASTER node:| 7408 | root | 192.168.1.4:49938 | test | Query | 0 | wsrep: pre-commit/certification passed (15406) | COMMIT
| 0 | 0 |
● Replicating/Slave node:| 1569 | system user | | NULL | Sleep | 0 | wsrep: committed write set (14001) | NULL
| 0 | 0 |
Let US RUN some quick workload (Say UPDATE_INDEX)
and track some of the IMPORTANT stages that you
may see OFTEN.
improved wsrep stages
exercise
pxc performance
pxc performance
pxc performance
pxc performance
pxc performance
● MAJOR MILESTONE IN PXC HISTORY.● Performance improved for all kind of workload and configuration
○ sync_binlog=0/1
○ log-bin=enabled/disabled
○ OLTP/UPDATE_NON_INDEX/UPDATE_INDEX.
○ Scale even with 1024 threads and 3 nodes.
pxc performance
● So what were the problems and how we solved it.
● We found 3 major issues that blocked exploring effective use of
○ REDO log flush parallelism
○ Group-Commit protocol (LEADER-FOLLOWER batch operation protocol)
pxc performance
● How commit operates:○ COMMIT (2 steps)
■ prepare (mark trx as prepare, redo flush (if log-bin=0)). If operating in cluster-mode (PXC) replicate the
transaction + enforce ordering.
■ commit (flush bin logs + redo flush + commit innodb)
● PROBLEM-1: CommitMonitor was exercised such that REDO flush get serialized.
○ Split replicate + enforce ordering
● PROBLEM-2: Group-Commit Logic is optimized for batch load but with Commit Ordering in PXC, only 1 transaction
was forward to GC logic.
○ Release Commit Ordering early and rely on GC-Logic to enforce ordering. (interim_commit)
● PROBLEM-3: Second fsync/redo-flush (with log-bin=0) was again serialized.
○ Release ComMon as order is already enforced + trx is committed in memory. Only redo log flush is pending.
pxc performance
pxc + pmm
● PXC is fully integrated with PMM.
● PMM helps monitor different aspect of PXC. Let’s see some of the key aspect:
○ How long can a node sustain without need for SST (projected based on historical data).
○ Size of Replicated write-set.
○ Conflict
○ Basic statistical information (as to how many nodes are up and running). If node goes down
that is reflected.
○ Is node in DESYNCed state or FLOW-Control ?
Exercise: pmmdemo.percona.com
monitor pxc through pmm
exercise
misc
● Starting PXC-5.7 we have simplified the packaging.
○ PXC-5.7 is single package that you need to install for all PXC components (mainly pxc server
+ replication library). garbd continue to be remains as separate package.
○ External tools like xtrabackup, proxysql are available from percona repo.
○ PXC is widely supported on most of the ubuntu + debian + centos platform.
simplified packaging
● PXC is MySQL/PS-5.7.17 compatible and we normally continue to keep it refresh with major release
of MySQL version (typically once per quarter).
● PXC is compatible with Codership Cluster (5.7 GAed few months back).
mysql/ps-5.7.17 compatible
tips to run successful cluster
TIP-1: All nodes of the cluster should have same configuration settings (except
those setting that are h/w dependent or node specific)
○ For example: wsrep_max_ws_size is used to control the large transaction, gcs.fc_limit
(configured through wsrep_provider_options), wsrep_replicate_myisam (pxc-strict-mode
doesn’t allow it any more), etc…
○ You should review the default setting and adjust it as per your environment but avoid changing
settings that you are not sure about. For example: wsrep_sync_wait helps read the fresh data
but increases SELECT latency. It is recommended to set this only if there is need for fresh
data.
tips to run successful cluster
TIP-2: Ensure timeouts are configured as per the your environment.
○ Default timeout set in PXC works in most cases (mainly for LAN setup) but each n/w setup is
different and so understand the latency, server configuration, load server need to handle and
accordingly set the timeouts.
○ PXC loves stable network connectivity. So try to install PXC on server with stable network
connectivity.
○ GARBD may not need a high-end configuration server but still need a good network
connectivity.
tips to run successful cluster
TIP-3: Monitor your PXC cluster regularly.
○ Monitoring is must for any system for its smooth operation and PXC is not an exception.
○ Plan outages carefully, if you expect a sudden increase workload for small time make sure
enough resources are allocated for PXC to handle the increased workload (say
wsrep_slave_threads).
○ Ensure that nodes doesn’t keep on leaving the cluster on/off without any know reason.
○ Don’t boot multiple nodes to join through SST all of sudden. One at a time. (Node is offered
membership first and then data transfer begin).
tips to run successful cluster
TIP-4: Understand your workload○ Workload of each user is different and it is important to understand what kind of workload your
application is generating.
○ One of such user observed that his cluster drags at some point of day and then it take good
time for the cluster to catch-up. Application of user was firing ALTER command on a large
table while the DML was in progress causing complete stall of DML command. On restoration
DML workload continued and slaved failed to catch-up due to insufficient resources (read
slave-thread).
○ Avoid use of experimental features (strict-mode will anyway block them)
○ Double check the semantics while using commands like FTWRL, RSU, Events.
tips to run successful cluster
TIP-5: what to capture if you hit an issue○ Clear description of what kind of setup you had (3 node cluster, 2 nodes + garbd, master
replicating to cluster, slave, etc….)
○ my.cnf + command line options + show-status + any other configuration that you know you
have modified. (If they are different for different nodes then capture them for all the nodes).
○ Complete error.log from all of the nodes. (If it is involves SST failure then supporting error-logs
are present in data-dir (or hidden folder .sst))
○ What was state of cluster while it was working fine and rough idea of workload that was active
that pushed cluster in the current state.
tips to run successful cluster
thanks for attending
● We would love to hear what more you want from PXC or what problem you face in using PXC and
would be happy to assist you or take a feature request and put it on product roadmap.
● connect with us
○ forum: https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster
○ email: [email protected], [email protected]
○ drop in at Percona Booth
○ catch us during conference
connect with us