Percona XtraDB Cluster TutorialTable of Contents 1. Overview & Setup 6. Monitoring Galera 2. Migrating from M/S 7. Incremental State Transfer 3. Galera Replication Internals 8. Node

Percona XtraDB Cluster Tutorialhttp://www.percona.com/training/

© 2011 - 2017 Percona, Inc. 1 / 102

http://www.percona.com/training/

Table of Contents

1. Overview & Setup 6. Monitoring Galera

2. Migrating from M/S 7. Incremental StateTransfer

3. Galera ReplicationInternals 8. Node Failure Scenarios

4. Online Schema Changes 9. Avoiding SST5. Application HA 10. Tuning Replication

© 2011 - 2017 Percona, Inc. 2 / 102

XtraDB Cluster TutorialOVERVIEW

© 2011 - 2017 Percona, Inc. 3 / 102

Resourceshttp://www.percona.com/doc/percona-xtradb-cluster/5.7/http://www.percona.com/blog/category/percona-xtradb-cluster/http://galeracluster.com/documentation-webpages/reference.html

© 2011 - 2017 Percona, Inc. 4 / 102

http://www.percona.com/doc/percona-xtradb-cluster/5.7/

http://www.percona.com/blog/category/percona-xtradb-cluster/

http://galeracluster.com/documentation-webpages/reference.html

Standard ReplicationReplication is a mechanism for recording a series ofchanges on one MySQL server and applying the samechanges to one, or more, other MySQL server.The source is the "master" and its replica is a "slave."

© 2011 - 2017 Percona, Inc. 5 / 102

Galera ReplicationIn Galera, the dataset is synchronized between one ormore servers.You can write to any node; every node stays in sync.

© 2011 - 2017 Percona, Inc. 6 / 102

Our Current SetupSingle master node; Two slave nodesnode1 will also run sysbench

Sample application, simulating activity

© 2011 - 2017 Percona, Inc.

(*) Your node IP addresses will probably be different.

7 / 102

Verify Connectivity and SetupConnect to all 3 nodes

Use separate terminal windows; we will be switchingbetween them frequently.Password is vagrant

ssh -p 22201 vagrant@localhostssh -p 22202 vagrant@localhostssh -p 22203 vagrant@localhost

Verify MySQL is runningVerify replication is running

SHOW SLAVE STATUS\G

© 2011 - 2017 Percona, Inc. 8 / 102

Run Sample ApplicationStart the application and verify

node1# run_sysbench_oltp.sh

Is replication flowing to all the nodes?Observe the sysbench transaction rate

© 2011 - 2017 Percona, Inc. 9 / 102

Checking ConsistencyRun a pt-table-checksum from node1

node1# pt-table-checksum -d test

Check for differences

node3> SELECT db, tbl, SUM(this_cnt) AS numrows, COUNT(*) AS chunks FROM percona.checksumsWHERE master_crc <> this_crc GROUP BY db, tbl;

See all the results

node3> SELECT * FROM percona.checksums;

Are there any differences between node1 and its slaves?If so, why?

© 2011 - 2017 Percona, Inc. 10 / 102

Monitoring Replication LagStart a heartbeat on node1

node1# pt-heartbeat --update --database percona \ --create-table --daemonize

Monitor the heartbeat on node2

node2# pt-heartbeat --monitor --database percona \ --master-server-id=1

Stop the heartbeat on node1 and see how that affects themonitor.STOP SLAVE on node2 for a while and see how thataffects the monitor.

© 2011 - 2017 Percona, Inc. 11 / 102

XtraDB Cluster TutorialMIGRATING FROM MASTER/SLAVE

© 2011 - 2017 Percona, Inc. 12 / 102

What is Our Plan?We have an application working on node1.We want to migrate our master/slave to a PXC cluster withminimal downtime.We will build a cluster from one slave and then add theother servers to it one at a time.

© 2011 - 2017 Percona, Inc. 13 / 102

Upgrade node3node3# systemctl stop mysql

node3# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57

node3# systemctl start mysql

node3# mysql_upgrade

node3# mysql -e "show slave status\G"

node3# systemctl stop mysql -- For the next steps

The Percona XtraDB Cluster Server and Client packagesare drop-in replacements for Percona Server and evencommunity MySQL.

© 2011 - 2017 Percona, Inc. 14 / 102

Configure Node3 for PXC[mysqld]# Leave existing settings and add these

binlog_format = ROW

# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node3wsrep_node_address = 192.168.70.4wsrep_sst_auth = sst:secret

innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = ON

Start node3 with:systemctl start mysql@bootstrap

© 2011 - 2017 Percona, Inc. 15 / 102

Checking Cluster StateCan you tell:

How many nodes are in the cluster?Is the cluster Primary?Are cluster replication events being generated?Is replication even running?

© 2011 - 2017 Percona, Inc. 16 / 102

Replicated EventsCannot async replicate to node3!What would the result be if we added node2 to thecluster?What can be done to be sure writes from replication intonode3 are replicated to the cluster?

© 2011 - 2017 Percona, Inc.

(*) Hint: If you’ve ever configured slave-of-slave in MySQL replication, you know the fix!

17 / 102

Fix ConfigurationMake this change:

[mysqld]...log-slave-updates...

Restart MySQL

# systemctl restart mysql@bootstrap

Why did we have to restart with bootstrap?

© 2011 - 2017 Percona, Inc. 18 / 102

Fix ConfigurationMake this change:

[mysqld]...log-slave-updates...

Restart MySQL

# systemctl restart mysql@bootstrap

Why did we have to restart with bootstrap?Cause this is still a cluster of 1 node.

© 2011 - 2017 Percona, Inc. 19 / 102

Checking Cluster State AgainUse 'myq_status' to check the state of Galera on node3:

node3# /usr/local/bin/myq_status wsrep

© 2011 - 2017 Percona, Inc. 20 / 102

Upgrade node2We only need a single node replicating from ourproduction master.

node2> STOP SLAVE; RESET SLAVE ALL;

Upgrade/Swap packages as beforenode2# systemctl stop mysql

node2# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \ -- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57

Don't start mysql yet!Edit node2's my.cnf as you did for node3What needs to be different from node3's config?

© 2011 - 2017 Percona, Inc. 21 / 102

node2's Config[mysqld]# Leave existing settings and add these

binlog_format = ROWlog-slave-updates

# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node2 -- !!! CHANGE !!!wsrep_node_address = 192.168.70.3 -- !!! CHANGE !!!wsrep_sst_auth = sst:secret


Don't start MySQL yet!

© 2011 - 2017 Percona, Inc. 22 / 102

What's going to happen?Don't start MySQL yet!When node2 is started, what's going to happen?How does node2 get a copy of the dataset?

Can it use the existing data it already has?Do we have to bootstrap node2?

© 2011 - 2017 Percona, Inc. 23 / 102

State Snapshot Transfer (SST)Transfer a full backup of an existing cluster member(donor) to a new node entering the cluster (joiner).We configured our SST method to use ‘xtrabackup-v2’.

© 2011 - 2017 Percona, Inc. 24 / 102

Additional Xtrabackup SST SetupWhat about that sst user in your my.cnf?

[mysqld]...wsrep_sst_auth = sst:secret...

User/Grants need to be added for Xtrabackupnode3> CREATE USER 'sst'@'localhost';node3> ALTER USER 'sst'@'localhost' IDENTIFIED BY 'secret';node3> GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENTON *.* TO 'sst'@'localhost';

The good news is that once you get things figured out thefirst time, it's typically very easy to get an SST the firsttime on subsequent nodes.

© 2011 - 2017 Percona, Inc. 25 / 102

Up and Running node2Now you can start node2


Watch node2's error log for progressnode2# tail -f /var/lib/mysql/error.log

© 2011 - 2017 Percona, Inc. 26 / 102

Taking Stock of our AchievementsWe have a two node cluster:

Our production master replicates, asynchronously, intothe cluster via node3What might we do now in a real migration beforeproceeding?

© 2011 - 2017 Percona, Inc. 27 / 102

Finishing the MigrationHow should we finish the migration?How do we minimize downtime?What do we need to do to ensure there are no datainconsistencies?How might we rollback?

© 2011 - 2017 Percona, Inc. 28 / 102

Our Migration StepsShutdown the application pointing to node1Shutdown (and RESET) replication on node3 from node1

node3> STOP SLAVE; RESET SLAVE ALL;

Startup the application pointing to node3Edit run_sysbench_oltp.sh and change the IP./run_sysbench_oltp.sh to restart it

Rebuild node1 as we did for node2node1# systemctl stop mysqlnode1# yum swap -- remove Percona-Server-shared-57 Percona-Server-server-57 \-- install Percona-XtraDB-Cluster-shared-57 Percona-XtraDB-Cluster-server-57

© 2011 - 2017 Percona, Inc. 29 / 102

node1's Config[mysqld]# Leave existing settings and add these

binlog_format = ROWlog-slave-updates

# galera settingswsrep_provider = /usr/lib64/libgalera_smm.sowsrep_cluster_name = myclusterwsrep_cluster_address = gcomm://node1,node2,node3wsrep_node_name = node1 -- !!! CHANGE !!!wsrep_node_address = 192.168.70.2 -- !!! CHANGE !!!wsrep_sst_auth = sst:secret



© 2011 - 2017 Percona, Inc. 30 / 102

Where are We Now?

© 2011 - 2017 Percona, Inc. 31 / 102

XtraDB Cluster TutorialGALERA REPLICATION INTERNALS

© 2011 - 2017 Percona, Inc. 32 / 102

Replication RolesWithin the cluster, all nodes are equalmaster/donor node

The source of the transaction.slave/joiner node

Receiver of the transaction.Writeset: A transaction; one or more row changes

© 2011 - 2017 Percona, Inc. 33 / 102

State Transfer RolesNew nodes joining an existing cluster get provisionedautomatically

Joiner = New nodeDonor = Node giving a copy of the data

State Snapshot transferFull backup of Donor to Joiner

Incremental Snapshot transferOnly changes since node left cluster

© 2011 - 2017 Percona, Inc. 34 / 102

What's inside a write-set?The payload; Generated from the RBR eventKeys; Generated by the source node

Primary KeysUnique KeysForeign Keys

Keys are how certification happesA PRIMARY KEY on every table is required.

© 2011 - 2017 Percona, Inc. 35 / 102

Replication StepsAs Master/Donor

ApplyReplicateCertifyCommit

As Slave/JoinerReplicate (receive)Certify (includes reply to master/donor)ApplyCommit

© 2011 - 2017 Percona, Inc. 36 / 102

What is Certification?In a nutshell: "Can this transaction be applied?"Happens on every node for all write-setsDeterministicResults of certification are not relayed back tomaster/donor

In event of failure, drop transaction locally; Possiblyremove self from cluster.On success, write-set enters apply queue

Serialized via group communication protocol

© 2011 - 2017 Percona, Inc.

(*) https://www.percona.com/blog/2016/04/17/how-percona-xtradb-cluster-certification-works/

37 / 102

https://www.percona.com/blog/2016/04/17/how-percona-xtradb-cluster-certification-works/

The ApplyApply is done locally, asynchronously, on each node, aftercertification.Can be parallelized like standard async replication (moreon this later).If there is a conflict, will create brute force aborts (bfa) onlocal node.

© 2011 - 2017 Percona, Inc. 38 / 102

Last but not least, Commit StageLocal, on each node, InnoDB commit.Applier thread executes on slave/joiner nodes.Regular connection thread for master/donor node.

© 2011 - 2017 Percona, Inc. 39 / 102

Galera Replication

© 2011 - 2017 Percona, Inc. 40 / 102

Galera Replication Facts"Replication" is synchronous; slave apply is notA successful commit reply to client means a transactionWILL get applied across the entire cluster.A node that tries to commit a conflicting transaction beforewill get a deadlock error.

This is called a local certification failure or 'lcf'When our transaction is applied, any competingtransactions that haven't committed will get a deadlockerror.

This is called a brute force abort or 'bfa'

© 2011 - 2017 Percona, Inc. 41 / 102

Unexpected DeadlocksCreate a test table

node1 mysql> CREATE TABLE test.deadlocks (i INT UNSIGNED NOT NULL -> PRIMARY KEY, j VARCHAR(32) );node1 mysql> INSERT INTO test.deadlocks VALUES (1, NULL);node1 mysql> BEGIN; -- Start explicit transactionnode1 mysql> UPDATE test.deadlocks SET j = 'node1' WHERE i = 1;

Before we commit on node1, go to node3 in a separatewindow:

node3 mysql> BEGIN; -- Start transactionnode3 mysql> UPDATE test.deadlocks SET j = 'node3' WHERE i = 1;

node3 mysql> COMMIT;

node1 mysql> COMMIT;

node1 mysql> SELECT * FROM test.deadlocks;

© 2011 - 2017 Percona, Inc. 42 / 102

Unexpected Deadlocks (cont.)Which commit succeeds?Will only a COMMIT generate the deadlock error?Is this a lcf or bfa?How would you diagnose this error?Things to watch during this exercise:

myq_status (especially on node1)SHOW ENGINE INNODB STATUS

© 2011 - 2017 Percona, Inc. 43 / 102

Application HotspotsStart sysbench on all the nodes at the same time

$ /usr/local/bin/run_sysbench_update_index.sh

Watch myq_status 'Conflct' columns for activity

$ myq_status wsrep

Edit the script and adjust the following until you see someconflicts:

--oltp-table-size smaller (250)--tx-rate higher (75)

© 2011 - 2017 Percona, Inc. 44 / 102

When could you multi-node write?When your application can handle the deadlocksgracefully

wsrep_retry_autocommit can helpWhen the successful commit to deadlock ratio is not toosmall

© 2011 - 2017 Percona, Inc. 45 / 102

AUTO_INCREMENT HandlingPick one node to work on:

node2 mysql> create table test.autoinc ( -> i int not null auto_increment primary key, -> j varchar(32) );

node2 mysql> insert into test.autoinc (j) values ('node2' );node2 mysql> insert into test.autoinc (j) values ('node2' );node2 mysql> insert into test.autoinc (j) values ('node2' );

Now select all the data in the tablenode2 mysql> select * from test.autoinc;

What is odd about the values for 'i'?What happens if you do the inserts on each node inorder?Check wsrep_auto_increment_control andauto_increment% global variables

© 2011 - 2017 Percona, Inc. 46 / 102

The Effects of Large TransactionsSysbench on node1 as usual (Be sure to edit script andreset connection to localhost)

node1# run_sysbench_oltp.sh

Insert 100K rows into a new table on node2

node2 mysql> CREATE TABLE test.large LIKE test.sbtest1;node2 mysql> INSERT INTO test.large SELECT * FROM test.sbtest1;

Pay close attention to how the application (sysbench) isaffected by the large transactions

© 2011 - 2017 Percona, Inc. 47 / 102

Serial and Parallel ReplicationEvery node replicates, certifies, and commits EVERYtransaction in the SAME order

This is serializationReplication allows interspersing of smaller transactionssimultaneously

Big transactions tend to force serialization on thisThe parallel slave threads can APPLY transactions withno interdependencies in parallel on SLAVE nodes (i.e.,nodes where the trx originated from)

Infinite wsrep_slave_threads != infinite scalability

© 2011 - 2017 Percona, Inc. 48 / 102

XtraDB Cluster TutorialONLINE SCHEMA CHANGES

© 2011 - 2017 Percona, Inc. 49 / 102

How do we ALTER TABLEDirectly - ALTER TABLE

"Total Order Isolation" - TOI (default)"Rolling Schema Upgrades" - RSU

pt-online-schema-change

© 2011 - 2017 Percona, Inc. 50 / 102

Direct ALTER TABLE (TOI)Ensure sysbench is running against node1 as usualTry these ALTER statements on the nodes listed and note theeffect on sysbench throughput.

node1 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `m` varchar(32);node2 mysql> ALTER TABLE test.sbtest1 ADD COLUMN `n` varchar(32);

Create a different table not being touched by sysbench

node2 mysql> CREATE TABLE test.foo LIKE test.sbtest1; node2 mysql> INSERT INTO test.foo SELECT * FROM test.sbtest1;node2 mysql> ALTER TABLE test.foo ADD COLUMN `o` varchar(32);

Does running the command from another node make adifference?What difference is there altering a table that sysbench is notusing?

© 2011 - 2017 Percona, Inc. 51 / 102

Direct ALTER TABLE (RSU)Change to RSU:

node2 mysql> SET wsrep_osu_method = 'RSU';node2 mysql> alter table test.foo add column `p` varchar(32);node2 mysql> alter table test.sbtest1 add column `q` varchar(32);

Watch myq_status and the error log on node2:

node2 mysql> alter table test.sbtest1 drop column `c`;

What are the effects on sysbench after adding the columns pand q?What happens after dropping 'c'?

Why?How do you recover?

When is RSU safe and unsafe to use?

© 2011 - 2017 Percona, Inc. 52 / 102

pt-online-schema-changeBe sure all your nodes are set back to TOI

node* mysql> SET GLOBAL wsrep_osu_method = 'TOI';

Let's add one final column:

node2# pt-online-schema-change --alter \ "ADD COLUMN z varchar(32)" D=test,t=sbtest1 --execute

Was pt-online-schema-change any better at doing DDL?Was the application blocked at all?What are the caveats to using pt-osc?

© 2011 - 2017 Percona, Inc. 53 / 102

XtraDB Cluster TutorialAPPLICATION HIGH AVAILABILITY

© 2011 - 2017 Percona, Inc. 54 / 102

ProxySQL Architecture

© 2011 - 2017 Percona, Inc. 55 / 102

Setting up HAProxyInstall HAproxy - Just on node1Create /etc/haproxy/haproxy.cfg

Copy from /root/haproxy.cfgRestart haproxyTest connection to proxy ports

node1# mysql -u test -ptest -h 192.168.70.2 -P 4306 \ -e "show variables like 'wsrep_node_name'"node1# mysql -u test -ptest -h 192.168.70.2 -P 5306 \ -e "show variables like 'wsrep_node_name'"

Test shutting down nodesWatch http://127.0.0.1:9991

© 2011 - 2017 Percona, Inc. 56 / 102

http://127.0.0.1:9991/

HAProxy Maintenance ModePut a node in to maintenance mode

node1# echo "disable server cluster-reads/node3" | socat \/var/lib/haproxy/haproxy.sock stdio

Refresh HAProxy status pageTest connections to port 5306

node3 is still online but HAProxy won't routeconnections it

Enable the node

node1# echo "enable server cluster-reads/node3" | socat \/var/lib/haproxy/haproxy.sock stdio

Refresh status page and test connections

© 2011 - 2017 Percona, Inc. 57 / 102

keepalived Architecture

© 2011 - 2017 Percona, Inc. 58 / 102

keepalived for simple VIPInstall keepalived (all nodes)Create /etc/keepalived/keepalived.conf

Copy from /root/keepalived.confsystemctl start keepalivedCheck for which host has the VIP

ip addr list | grep 192.168.70.100Experiment with restarting nodes while checking hostanswering the vip

while( true; ) do mysql -u test -ptest -h 192.168.70.100 --connect_timeout=5 \ -e "show variables like 'wsrep_node_name'";sleep 1;done

© 2011 - 2017 Percona, Inc. 59 / 102

Application HA EssentialsNodes are properly health checkedHA is mostly commonly handled at layer4 (TCP)

Therefore, expect node failures to mean appreconnects

The cluster will prevent apps from doing anything badBut, the cluster may not necessarily force apps toreconnect by itself

Watch out for Donor nodes

© 2011 - 2017 Percona, Inc. 60 / 102

XtraDB Cluster TutorialMONITORING GALERA

© 2011 - 2017 Percona, Inc. 61 / 102

At a Low Levelmysql> SHOW GLOBAL VARIABLES LIKE ‘wsrep%’;mysql> SHOW GLOBAL STATUS LIKE ‘wsrep%’;

Refer to:http://www.codership.com/wiki/doku.php?id=mysql_options_0.8

http://www.codership.com/wiki/doku.php?id=galera_status_0.8

http://www.percona.com/doc/percona-xtradb-cluster/wsrep-status-index.html

http://www.percona.com/doc/percona-xtradb-cluster/wsrep-system-

index.html

© 2011 - 2017 Percona, Inc. 62 / 102

http://www.codership.com/wiki/doku.php?id=mysql_options_0.8

http://www.codership.com/wiki/doku.php?id=galera_status_0.8

http://www.percona.com/doc/percona-xtradb-cluster/wsrep-status-index.html

http://www.percona.com/doc/percona-xtradb-cluster/wsrep-system-index.html

Example Nagios Checksnode1# yum install percona-nagios-plugins -y

-- Verify we’re in a Primary clusternode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_status -C == -T str -c non-Primary

-- Verify the node is Syncednode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_local_state_comment -C '!=' -T str -w Synced

-- Verify the size of the cluster is normalnode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_cluster_size -C '<=' -w 2 -c 1

-- Watch for flow controlnode1# /usr/lib64/nagios/plugins/pmp-check-mysql-status \ -x wsrep_flow_control_paused -w 0.1 -c 0.9

© 2011 - 2017 Percona, Inc. 63 / 102

Alerting Best practicesAlerting should be actionable, not noise.Any other ideas of types of checks to do?http://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins

© 2011 - 2017 Percona, Inc. 64 / 102

http://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins

Percona Monitoring and Management

© 2011 - 2017 Percona, Inc.

(*) https://pmmdemo.percona.com/

65 / 102

https://pmmdemo.percona.com/

ClusterControl - SeveralNines

© 2011 - 2017 Percona, Inc. 66 / 102

XtraDB Cluster TutorialINCREMENTAL STATE TRANSFER

© 2011 - 2017 Percona, Inc. 67 / 102

Incremental State Transfer (IST)Helps to avoid full SSTWhen a node starts, it knowns the UUID of the cluster towhich it belongs and the last sequence number it applied.Upon connecting to other cluster nodes, broadcast thisinformation.If another node has cached write-sets, trigger IST

If no other node has, trigger SSTFirewall Alert: Uses port 4568 by default

© 2011 - 2017 Percona, Inc. 68 / 102

Galera Cache - "gcache"Write-sets are contained in the "gcache" on each node.Preallocated file with a specific size.

Default size is 128MRepresents a single-file circular buffer.Can be increase via provider option

wsrep_provider_options = "gcache.size=2G";

Cache is mmaped (I/O buffered to memory)wsrep_local_cached_downto - The first seqno present inthe cache for that node

© 2011 - 2017 Percona, Inc. 69 / 102

Simple IST TestMake sure your application is runningWatch myq_status on all nodesOn another node

watch mysql error logstop mysqlstart mysql

How did the cluster react when the node was stopped andrestarted?Was the messaging from the init script and in the logclear?Is it easy to tell when a node is a Donor for SST vs IST?

© 2011 - 2017 Percona, Inc. 70 / 102

XtraDB Cluster TutorialNODE FAILURE SCENARIOS

© 2011 - 2017 Percona, Inc. 71 / 102

Node Failure ExercisesRun sysbench on node1 as usual, also watch myq_statusExperiment 1: Clean node restart

node3# systemctl restart mysql

© 2011 - 2017 Percona, Inc. 72 / 102

Node Failure Exercises (cont.)Experiment 2: Partial network outageSimulate a network outage by using iptables to drop traffic

node3# iptables -A INPUT -s node1 \ -j DROP; iptables -A OUTPUT -s node1 -j DROP

After observation, clear rules

node3# iptables -F

© 2011 - 2017 Percona, Inc. 73 / 102

Node Failure Exercises (cont.)Experiment 3: node3 stops responding to the clusterAnother network outage affecting 1 node

-- Pay Attention to the command continuation backslashes!-- This should all be 1 line!node3# iptables -A INPUT -s node1 -j DROP; \iptables -A INPUT -s node2 -j DROP; \iptables -A OUTPUT -s node1 -j DROP; \iptables -A OUTPUT -s node2 -j DROP

Remove iptables rules (after experiments 2 and 3)

node3# iptables -F

© 2011 - 2017 Percona, Inc. 74 / 102

A Non-Primary ClusterExperiment 1: Clean stop of 2/3 nodes (normal sysbenchload)

node2# systemctl stop mysqlnode3# systemctl stop mysql

© 2011 - 2017 Percona, Inc. 75 / 102

A Non-Primary Cluster (cont.)Experiment 2: Failure of 50% cluster (leave node2 down)


-- Wait a few minutes for IST to finish and for the-- cluster to become stable. Watch the error logs and-- SHOW GLOBAL STATUS LIKE 'wsrep%'

node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP

Force node1 to become primary like this:node1 mysql> set global wsrep_provider_options="pc.bootstrap=true";

Clear iptables when donenode3# iptables -F

© 2011 - 2017 Percona, Inc. 76 / 102

Using garbd as a third nodeNode2’s mysql is still stopped!Install the arbitrator daemon on to node2

node2# yum install Percona-XtraDB-Cluster-garbd-3.x86_64node2# garbd -a gcomm://node1:4567,node3:4567 -g mycluster

Experiment 1: Reproduce node3 failurenode3# iptables -A INPUT -s node1 -j DROP; \ iptables -A INPUT -s node2 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node2 -j DROP

© 2011 - 2017 Percona, Inc. 77 / 102

Using garbd as a third node (cont.)Experiment 2: Partial node3 network failure

node3# iptables -A INPUT -s node1 -j DROP; \ iptables -A OUTPUT -s node1 -j DROP

Remember to clear iptables when donenode3# iptables -F

© 2011 - 2017 Percona, Inc. 78 / 102

Recovering from a clean all-down stateStop mysql on all 3 nodes

$ systemctl stop mysql

Use the grastate.dat to find the node with the highestGTID

$ cat /var/lib/mysql/grastate.dat

Bootstrap that node and then start the others normally

$ systemctl start mysql@bootstrap$ systemctl start mysql$ systemctl start mysql

Why do we need to bootstrap the node with the highestGTID?Why can’t the cluster just “figure it out”?

© 2011 - 2017 Percona, Inc. 79 / 102

Recovering from a dirty all-down stateKill -9 mysqld on all 3 nodes

$ killall -9 mysqld_safe; killall -9 mysqld

Check the grastate.dat again. Is it useful?

$ cat /var/lib/mysql/grastate.dat

Try wsrep_recover on each node to get the GTID

$ mysqld_safe --wsrep_recover

Bootstrap the highest node and then start the othersnormallyWhat would happen if we started the wrong node?Wsrep Recovery pulls GTID data from the Innodbtransaction logs. In what cases might this be inaccurate?

© 2011 - 2017 Percona, Inc. 80 / 102

XtraDB Cluster TutorialAVOIDING SST

© 2011 - 2017 Percona, Inc. 81 / 102

Manual State Transfer with XtrabackupTake a backup on node3 (sysbench on node1 as usual)

node3# xtrabackup --backup --galera-info --target-dir=/var/backup

Stop node3 and prepare the backup

node3# systemctl stop mysqlnode3# xtrabackup --prepare --target-dir=/var/backup

© 2011 - 2017 Percona, Inc. 82 / 102

Manual State Transfer with XtrabackupCopy back the backup

node3# rm -rf /var/lib/mysql/*node3# xtrabackup --copy-back --target-dir=/var/backupnode3# chown -R mysql.mysql /var/lib/mysql

Get GTID from backup and update grastate.datnode3# cd /var/lib/mysqlnode3# mv xtrabackup_galera_info grastate.datnode3# $EDIT grastate.dat

# GALERA saved stateversion: 2.1uuid: 8797f811-7f73-11e2-0800-8b513b3819c1seqno: 22809safe_to_bootstrap: 0


© 2011 - 2017 Percona, Inc. 83 / 102

Cases where the state is zeroedAdd a bad config item to section [mysqld] in /etc/my.cnf

[mysqld]fooey

Check what happens to the grastate:

node3# systemctl stop mysqlnode3# cat /var/lib/mysql/grastate.datnode3# systemctl start mysqlnode3# cat /var/lib/mysql/grastate.dat

Remove the bad config and use wsrep_recover to avoid SST

node3# mysqld_safe --wsrep_recover

Update the grastate.dat with that position and restartWhat would have happened if you restarted with a zeroedgrastate.dat?

© 2011 - 2017 Percona, Inc. 84 / 102

When you WANT an SSTnode3# cat /var/lib/mysql/grastate.dat

Turn off replication for the session:

node3 mysql> SET wsrep_on = OFF;

Repeat until node3 crashes (sysbench should be runningon node1)

node3 mysql> delete from test.sbtest1 limit 10000;..after crash..node3# cat /var/lib/mysql/grastate.dat

What is in the error log after line the crash?Should we try to avoid SST in this case?

© 2011 - 2017 Percona, Inc. 85 / 102

Performing ChecksumsCreate a User

node1 mysql> CREATE USER 'checksum'@'%';node1 mysql> ALTER USER 'checksum'@'%' IDENTIFIED BY 'checksum1';node1 mysql> GRANT SUPER, PROCESS, SELECT ON *.* TO 'checksum'@'%';node1 mysql> GRANT CREATE, SELECT, INSERT, UPDATE, DELETE ON percona.* TO 'checksum'@'%';

Create Dummy Tablenode1 mysql> CREATE TABLE test.fooey (id int PRIMARY KEY, name varchar(20));node1 mysql> INSERT INTO test.fooey VALUES (1, 'Matthew');

Verify row exists on all nodes

© 2011 - 2017 Percona, Inc. 86 / 102

Performing Checksums (cont.)Run Checksum

node1# pt-table-checksum -d test --recursion-method cluster \-u checksum -p checksum1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE04-23T16:42:20 0 0 1 1 0 0.039 test.fooey04-23T16:37:38 0 0 250000 5 0 1.931 test.sbtest1

Change data without replicating!node3 mysql> SET wsrep_on = OFF;node3 mysql> UPDATE test.fooey SET name = 'Charles';

Re-run checksum command above on node1Who has the error?

ALL mysql> SELECT COUNT(*) FROM checksums WHERE this_crc<>master_crc;

© 2011 - 2017 Percona, Inc. 87 / 102

Fix Bad Data!Use pt-table-sync between a good node and the bad node

node1# pt-table-sync --execute h=node1,D=test,t=fooey h=node3

© 2011 - 2017 Percona, Inc. 88 / 102

XtraDB Cluster TutorialTUNING REPLICATION

© 2011 - 2017 Percona, Inc. 89 / 102

What is Flow Control?Ability for any node in the cluster to tell the other nodes topause writes while it catches upFeedback mechanism for the replication processONLY caused by wsrep_local_recv_queue exceeding anode's fc_limitThis can pause the entire cluster and look like a clusterstall!

© 2011 - 2017 Percona, Inc. 90 / 102

Observing Flow ControlRun sysbench on node1 as usualTake a read lock on node3 and observe its effect on thecluster

node3 mysql> SET GLOBAL pxc_strict_mode = 0;node3 mysql> LOCK TABLE test.sbtest1 WRITE;

Make observations. When you are done, release the lock:

node3 mysql> UNLOCK TABLES;

What happens to sysbench throughput?What are the symptoms of flow control?How can you detect it?

© 2011 - 2017 Percona, Inc. 91 / 102

Tuning Flow ControlRun sysbench on node1 as usualIncrease the fc_limit on node3

node3 mysql> SET GLOBAL wsrep_provider_options="gcs.fc_limit=500";

Take a read lock on node3 and observe its effect on thecluster

node3 mysql> LOCK TABLE test.sbtest1 WRITE;

Make observations. When you are done, release the lock:

node3 mysql> UNLOCK TABLES;

Now, how big does the recv queue on node3 get beforeFC kicks in?

© 2011 - 2017 Percona, Inc. 92 / 102

Flow Control Parametersgcs.fc_limit

Pause replication if recv queue exceeds that manywritesets.Default: 100. For master-slave setups this number canbe increased considerably.

gcs.fc_master_slaveWhen this is NO, then the effective gcs.fc_limit ismultipled by sqrt( number of cluster members ).Default: NO.Ex: gcs.fc_limit = 500 * sqrt(3) = 866

© 2011 - 2017 Percona, Inc. 93 / 102

Avoiding Flow Control during BackupsMost backups use FLUSH TABLES WITH READ LOCKBy default, this causes flow control almost immediatelyWe can permit nodes to fall behind:

node3 mysql> SET GLOBAL wsrep_desync=ON;node3 mysql> FLUSH TABLES WITH READ LOCK;

Do not write to this node!No longer the case with 5.7! Yay!

© 2011 - 2017 Percona, Inc. 94 / 102

More Notes on Flow ControlIf you set it too low:

Too many cries for help from random nodes in thecluster

If you set it too high:Increases in replication conflicts in multi-node writingsIncreases in apply/commit lag on nodes

That one node with FC issues:Deal with that node: bad hardware? not same specs?

© 2011 - 2017 Percona, Inc. 95 / 102

Testing Node Apply throughputUsing wsrep_desync and LOCK TABLE, we can stopreplication on a node for an arbitrary amount of time (untilit runs out of disk-space).We can stop a node for a while, release it, and measurethe top speed at which it applies.This gives is an idea of how fast the node is capable ofhandling the bulk of replication.Performance here is greatly improved in 5.7(*)

© 2011 - 2017 Percona, Inc.

(*) https://www.percona.com/blog/2017/04/19/how-we-made-percona-xtradb-cluster-scale/

96 / 102

https://www.percona.com/blog/2017/04/19/how-we-made-percona-xtradb-cluster-scale/

Multiple Applier ThreadsUsing the wsrep_desync trick, you can measure thethroughput of a slave with any setting ofwsrep_slave_threads you wish.wsrep_cert_deps_distance is the maximum "theoretical"parallel threads

Practically, it's likely you should probably have muchfewer slave threads (500 is a bit much)

Parallel apply has been a source of node inconsistenciesin the past

© 2011 - 2017 Percona, Inc. 97 / 102

Wsrep Sync WaitApply is asynchronous, so reads after writes on differentnodes may see an older view of the database.Practically, flow control prevents this from becoming likeregular Async replicationTighter consistency can be enforced by ensuring thereplication queue is flushed before a read

SET <session|global> wsrep_sync_wait=[0-7];

© 2011 - 2017 Percona, Inc. 98 / 102

XtraDB Cluster TutorialCONCLUSION

© 2011 - 2017 Percona, Inc. 99 / 102

Common Configuration Parameters

wsrep_sst_auth = un:pw Set SST Authentication

wsrep_osu_method = 'TOI' Set DDL Method

wsrep_desync = ON; Enable "Replication"

wsrep_slave_threads = 4; Number of applier threads

wsrep_retry_autocommit = 4; Number of commit retries

wsrep_provider_options:

* gcs.fc_limit = 500 Increase flow control threshold

* pc.bootstrap = true Put node into single mode

* gcache.size = 2G Increase Galera Cache file

wsrep_sst_donor = node1 Specify donor node

© 2011 - 2017 Percona, Inc. 100 / 102

Common Status Counters

wsrep_cluster_status Primary / Non-Primarywsrep_local_state_comment Synced / Donor / Desyncwsrep_cluster_size Number of online nodes

wsrep_flow_control_paused Number of seconds clusterhas paused

wsrep_local_recv_queue Current number of write-setswaiting to be applied

© 2011 - 2017 Percona, Inc. 101 / 102

Questions?

© 2011 - 2017 Percona, Inc.

(*) Slide content authors: Jay Janssen, Fred Descamps, Matthew Boehm

102 / 102

PXC - Hands Tutorial

ist/sst donor selection

● PXC has concept of segment (gmcast.segment)

Check if you can find out gmcast.segment (by querying select @@wsrep_provider_options)

● Assign same segment id to cluster node that are co-located together in same subnet/LAN.● Assign different segment id to cluster node that are located in different subnet/WAN

SELECT SUBSTRING(@@wsrep_provider_options, LOCATE('gmcast.segment', @@wsrep_provider_options), 18);

concept of segment

exercise

subnet-2subnet-1

gmcast.segment=0 gmcast.segment=1

● SST DONOR Selection○ Search by name (if specified by user --wsrep_sst_donor).

■ If node state restrict it from acting as DONOR things get delayed.

Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable

○ Search donor by state:■ Scan all nodes.■ Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator).

● YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR● NO: Keep searching but cache this remote segment too. Worse case will use this.

sst donor selection

● cluster see workload

ist donor selection

n1

n3n2

wset-1

wset-2

● workload is processed and replicated on all the nodes

ist donor selection

n1

n3n2

wset-1

wsetwset-1

wset-2

wset-2 wset-2

● n3 needs to leave cluster for sometime

ist donor selection

n1

n3

n2

wset-1

wset-1

wset-1

wset-2

wset-2wset-2

● cluster makes further progress

ist donor selection

n1

n3

n2

wset-1

wset-1

wset-3

wset-3

wset-2

wset-2

wset-4

wset-4

wset-1

wset-2

● n1/n2 needs to purge gcache.

ist donor selection

n1

n3

n2

wset-1

wset-1

wset-3

wset-3

wset-2

wset-2

wset-4

wset-1

wset-2

wset-4

● n3 need to re-join the cluster.

● Which node should it select and Why ?

ist donor selection

n1

n3

n2

wset-1

wset-1

wset-3

wset-3

wset-2

wset-2

wset-4

wset-1

wset-2

wset-4

● n3 will select n2 given that n2 has purged enough and probability of it purging further (with respect to n1) is low.

● wsrep_local_cached_downto is variable to look for. It indicate the lowest sequence number held.

● Concept of Safety-Gap.

ist donor selection

n1

n3

n2

wset-1

wset-1

wset-3

wset-3

wset-2

wset-2

wset-4

wset-1

wset-2

wset-4

● Test ?

○ 4 nodes n1, n2, n3, n4.○ n1 and n2 are in one-segment○ n3 and n4 are in another-segment○ n3 needs to leave cluster. n3 has trx upto write-set (100).○ Cluster make progress and purges too.○ Latest available write-sets: n1 (200), n2 (250), n4 (200).○ n3 want to re-join. Which DONOR will selected and Why ?

ist donor selection

galera gtid

● MySQL GTID vs Galera GTID○ MySQL GTID is based on server_uuid. This means it is different on all the nodes.○ Galera GTID is used explicitly for replication and it same on all the nodes that are part of the cluster.

select @@sever_uuid;select @@global.gtid_executed;show status like ‘wsrep_cluster_state_uuid’;

● Global Sequence Number:○ Incoming queue is like FIFO○ Each write-set is delivered to all the nodes of cluster in same order.

● Galera GTID is assigned only for successful action.

6a088373-da4b-ee18-6bf3-f0b4e6e19c7d:1-3

gtid and global-seqno

exercise

understanding timeout

understanding timeout

N1 N3N2

Group Channel

State-1: 3 nodes cluster all are healthy. When there is no workload keep_alive signal is sent every 1 sec

N1 N3N2

Group ChannelState-2: N1 has flaky network connection and can’t respond in set inactive_check_period. Cluster (N2, N3) mark N1 to be added delayed list and adds it once it crosses delayed margin.

N1 N3N2

Group ChannelState-3: N1 network connection is restored but to get it out from delayed listed N2 and N3 needs assurance from N1 for delayed_keep_period.

N1 N3N2

Group ChannelState-4: N1 again has flaky connection and it doesn’t respond back for inactive_timeout. (Node is already in delayed list). This will cause node to pronounce as DEAD.

geo-distributed cluster

● Nodes of cluster are distributed across data-centers still all nodes are part of one-big cluster.

geo-distributed cluster

N1

N3

N2 N4

N6

N5

WAN

● Important aspect to consider:

○ gmcast.segment that help notify topology. All local nodes should have same gmcast.segment. This

helps in optimizing replication traffic and of-course SST and IST.

○ Tune timeouts if needed for your WAN setup.

○ Avoid flow-control and fragmentation at galera level (example settings:

gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=256")

pxc-5.7

● PXC-5.7 GAed during PLAM-2016 (Sep-2016)○ Since then we have done 3 more releases

Percona XtraDB Cluster 5.7

Sep-16

PXC-5.7.14-26.17

Dec-16 Mar-17 Apr-17

PXC-5.7.16-27.19

PXC-5.7.17-27.20

PXC-5.7.17-29.20

● Cluster-Safe Mode (pxc-strict-mode)● Extended support for PFS● Support for encrypted tablespace● Improved security● ProxySQL compatible PXC● Tracking IST/Flow-control● Improved logging and wsrep-stage frame-work● Fully Scalable Performance Optimized PXC● PXC + PMM● Simplified packaging

Percona XtraDB Cluster 5.7

cluster-safe-mode

to block experimental features

cluster-safe-mode

Why pxc-strict-mode ?

binlog_format=row

Nontransactional SE

table-wo-primary-key

LOCAL LOCKS

LOCAL Ops (ALTER IMPORT/DISCARD)

CTAS

● pxc_strict_mode

● Mode is local to the given node. All nodes ideally should operate in same mode.● Not applicable on replicated traffic. Applicable only on user-traffic.● Moving to stricter mode, make sure validation checks passes.

cluster-safe-mode

ENFORCING=ERRORMASTER=ERROR

(except locks)

PERMISSIVE=WARNINGDISABLED=5.6-compatible

● Try to insert into a non-pk table.

mysql> use test;Database changed

mysql> create table t (i int) engine=innodb;Query OK, 0 rows affected (0.04 sec)

mysql> insert into t values (1);ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.t) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER

mysql> alter table t add primary key pk(i);Query OK, 0 rows affected (0.06 sec)Records: 0 Duplicates: 0 Warnings: 0

mysql> insert into t values (1);Query OK, 1 row affected (0.01 sec)

cluster-safe-mode

exercise

● Try to alter myisam table to innodb

mysql> alter table t engine=myisam; drop table t;ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits changing storage engine of a table (test.t) from transactional to non-transactional with pxc_strict_mode = ENFORCING or MASTER

mysql> create table myisam (i int) engine=myisam;Query OK, 0 rows affected (0.01 sec)

mysql> alter table myisam engine=innodb;Query OK, 0 rows affected (0.04 sec)Records: 0 Duplicates: 0 Warnings: 0

mysql> show create table myisam; drop table myisam;+--------+-------------------------------------------------------------------------------------------+| Table | Create Table |+--------+-------------------------------------------------------------------------------------------+| myisam | CREATE TABLE `myisam` ( `i` int(11) DEFAULT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin1 |+--------+-------------------------------------------------------------------------------------------+

cluster-safe-mode

exercise

● Switching to DISABLED and back to ENFORCING

mysql> set global pxc_strict_mode='DISABLED';Query OK, 0 rows affected (0.00 sec)

mysql> create table myisam (i int) engine=myisam;Query OK, 0 rows affected (0.01 sec)

mysql> set wsrep_replicate_myisam=1;Query OK, 0 rows affected (0.00 sec)

mysql> insert into myisam values (10);Query OK, 1 row affected (0.00 sec)

Check if you see entry (10) on node-2.

What will happen if we now switch

BACK from DISABLED TO ENFORCING ?

cluster-safe-mode

exercise

● Back to ENFORCING

mysql> set global pxc_strict_mode='ENFORCING';ERROR 1105 (HY000): Can't change pxc_strict_mode to ENFORCING as wsrep_replicate_myisam is ON

mysql> set wsrep_replicate_myisam=0;Query OK, 0 rows affected (0.00 sec)

mysql> set global pxc_strict_mode='ENFORCING';Query OK, 0 rows affected (0.00 sec)

node1 mysql> drop table myisam;Query OK, 0 rows affected (0.00 sec)

Switching will check for the following preconditions:● wsrep_replicate_myisam = 0● binlog_format = ROW● log_output = NONE/FILE● isolation level != SERIALIZABLE

cluster-safe-mode

exercise

● Replicated traffic is not scrutinized (not for TOI statement MyISAM DML).node-1:mysql> set global pxc_strict_mode='DISABLED';Query OK, 0 rows affected (0.00 sec)

mysql> create table t (i int) engine=innodb;Query OK, 0 rows affected (0.06 sec)

mysql> insert into t values (100);Query OK, 1 row affected (0.01 sec)

mysql> set global pxc_strict_mode='ENFORCING';Query OK, 0 rows affected (0.00 sec)

node-2:mysql> select * from test.t;+------+| i |+------+| 100 |+------+1 row in set (0.00 sec)

cluster-safe-mode

exercise

Remove all tables we created in this experiment.

(t, myisam). Check the modes are reset back to

ENFORCING

pxc+pfs

● PFS has concept of instruments. DB internal objects can be called instrument. Like mutexes,

condition variables, files, threads, etc...

● User can decide what all instruments to monitor based on need.

● PXC objects to monitor:

○ PXC threads (doesn’t generate event directly)

○ PXC files (doesn’t generate event directly)

○ PXC stages

○ PXC mutexes

○ PXC condition variables

performance-schema

● Applier/Slave Thread(s)

● Rollback/Aborter Thread

● IST Receiver/Sender Thread

● SST Joiner/Donor Thread

● Gcache File Remove Thread

● Receiver Thread

● Service Thread

● GComm Thread

● Write-set Checkum Thread

performance-schema: threads

● Ring Buffer File (GCache)

● GCache Overflow Page File

● RecordSet File

● grastate file/gvwstate file

performance-schema: files

● Any PXC action passes through different stages :

○ Writing/Deleting/Updating Rows

○ Applying write-sets

○ Commit/Rollback

○ Replicating Commit

○ Replaying

○ Preparing for TO Isolation

○ Applier Idle

○ Rollback/Aborted Idle

○ Aborter Active

performance-schema: stages

● Mutex/Cond-Variable/Monitor.

● Mutexes/Cond-Variables needed for coordination at each stage:

○ wsrep_ready, wsrep_sst_init, wsrep_sst, wsrep_rollback,

wsrep_slave_threads, wsrep_desync, wsrep_sst_thread,

wsrep_replaying.

○ certification mutex, stats collection mutex, monitor mutexes (that

control parallel replication), recvbuf mutex, transaction handling mutex,

saved state mutexes, etc.. and corresponding condition-variables.

● Monitor is like mutex that enforce ordering. Only single thread is allowed to

reside in monitor and which threads goes next depends on global_seqno.

performance-schema: mutex/cvar

● Mainly to monitor the cluster

○ For performance monitoring and tuning

○ What is taking time ? Certification ? Replication ? Commit ?

○ What is current state of the processing thread (processing + commit wait)

○ Track transaction rollback incident ?

○ …. etc.

○ More to come from performance analysis perspective in future.

why use pfs in pxc ?

● Let’s track some wsrep-stages

● Enable instrument to track on both node of cluster. (node-1, node-2)use performance_schema;

update setup_instruments set enabled=1, timed=1 where name like 'stage%wsrep%';

update setup_consumers set enabled=1 where name like 'events_stages_%';

● Start sysbench workload

● Let’s start tracking some state and time it takes.mysql> select thread_id, event_name, timer_wait/100000000 as msec from events_stages_history where event_name like '%in replicate

stage%' order by msec desc limit 10;

+-----------+---------------------------------------+--------+

| thread_id | event_name | msec |

+-----------+---------------------------------------+--------+

| 42 | stage/wsrep/wsrep: in replicate stage | 0.0506 |

● Let’s now see the Slave View

performance schema

Remember to disable PFS back to off once done.

exercise

● Let’s track conflict

● Enable instrument to track on both node of cluster. (node-1, node-2)use performance_schema;

update setup_instruments set enabled=’YES’, timed=’YES’ where name like 'wait%wsrep%';

update setup_consumers set enabled=’YES’ where name like '%wait%';

● Run a conflicting workload (bf_abort)

● Track the events generated.mysql> select thread_id, event_id, event_name, timer_wait/1000000000000 from events_waits_history_long where event_name like

'%COND%rollback%';

+-----------+----------+-----------------------------------------+--------------------------+

| thread_id | event_id | event_name | timer_wait/1000000000000 |

+-----------+----------+-----------------------------------------+--------------------------+

| 7 | 12 | wait/synch/cond/sql/COND_wsrep_rollback | 191.6299 |

| 7 | 18 | wait/synch/cond/sql/COND_wsrep_rollback | 117.6499 |

+-----------+----------+-----------------------------------------+--------------------------+

2 rows in set (0.00 sec)

performance schema

Remember to disable PFS back to off once done.

exercise

pxc + secure

● MySQL-5.7 introduced an encrypted tablespace.

● If your existing node of cluster (DONOR) has encrypted tablespace then you should

able to retain the same encrypted tablespace in encrypted fashion after donating it to

the JOINER.

● PXC takes care of moving the keyring (needed for unencrypting tablespace) and then

re-encrypting it with new keyring. (All this is done with help of XB (Xtrabackup)). All this

is done transparently.

● Since the keyring is transmitted between the 2 nodes it is important to ensure that SST

connection is encrypted and both the node should have keyring plugin configured.

support for encrypted tablespace

● PXC internode traffic needs to be secured. There are 2 main type of traffic

○ SST that is during the initial node-setup with involvement of 3rd party tool (XB).

○ IST/Replication traffic. Internal between PXC node.

● PXC-5.7 supports a mode encrypt=4 that accept the familiar triplet of ssl-key/ca/cert

just like MySQL-Client server communication. Remember to enable it on all nodes.

[sst]

encrypt=4

ssl-ca=<PATH>/ca.pem

ssl-cert=<PATH>/server-cert.pem

ssl-key=<PATH>/server-key.pem

● MySQL auto-generate these certificates during bootstrap you can continue to use them but ensure

all nodes uses same file or same ca-file for generating server-cert.

improved security

● For IST/Replication-traffic already supports triplet and here is way to specify it

wsrep_provider_options="socket.ssl=yes;socket.ssl_key=/path/to/server-key.pem;socket.ssl_cert=/path/to/server-c

ert.pem;socket.ssl_ca=/path/to/cacert.pem"

improved security

WAIT!!!! I am first time user. This sounds bit tedious.

● pxc-encrypt-cluster-traffic help to secure SST/IST/Replication traffic by passing

default key/ca/cert generated by MySQL.

● Just set this variable to ON (pxc-encrypt-cluster-traffic=ON). It will looks for files in

mysqld section (One that mysql uses for its own SSL connection) or in data directory for

auto-generated files. It then auto-pass them for SST/IST secure connection.

OK . that’s simple NOW !!!!

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/encrypt-traffic.html

improved security

pxc + proxysql

● ProxySQL is one of the leading load balancer.

● PXC is fully integrated with ProxySQL.

○ ProxySQL can monitor PXC nodes

○ Percona has developed a simple single step solution to auto-configure proxysql

tracking tables by auto-discovering pxc nodes. (proxysql-admin)

○ Same script can be used to also configure different setup

■ LoadBalancer: Each node is marked for read/write

■ SingleWriter: Only one node service WRITE. Other nodes can be used for

READ.

pxc + proxysql

● Often it is important to take a node down for some maintenance.

○ Shutdown the server

○ No load on server

● Naturally a user would like to perform these operation directly on node but is could be

risky with LoadBalancer involved as it may still be servicing it as active load query.

● To make easier we now allow user to set this using a single step process

● pxc_maint_mode (default=DISABLED)

● pxc_maint_transition_period (default=10)

proxyql assisted pxc mode

pxc_maint_mode


MAINTENANCE

On completion of maintenance user can switch it back to DISABLED.

SHUTDOWN

DISABLED

On shutdown of node.Node wait for pxc_maint_transition_period before delivering shutdown signal.

User switch the mode to MAINTENANCE.Node wait for pxc_maint_transition_period before returning prompt but mark the status updated immediately allowing proxysql to update the status

ProxySQL

Monitor for change in status and accordingly update the internal table to avoid sending traffic


● Install ProxySQL.cd /var/repo; yum install proxysql-1.3.5-1.1.x86_64.rpm

which proxysql

proxysql --version

which proxysql-admin

proxysql-admin -v

cat /etc/proxysql-admin.cnf

node1 mysql> CREATE USER ‘admin’@’%’ IDENTIFIED by ‘admin’;

node1 mysql> GRANT all privileges on *.* to ‘admin’@’%’;

node1 mysql> FLUSH privileges;exercise

proxyql assisted pxc mode● Ensure proxysql is running. If not start proxysql in foreground using “proxysql -f”

mysql -uadmin -padmin -P 6032 -h 127.0.0.1 (.... this will open proxysql terminal)

proxysql-admin -e --write-node="192.168.70.2:3306" (....proxysql user doesn’t have privileges)

node1 mysql> GRANT all privileges on *.* to ‘proxysql_user’@’192.%’;

node1 mysql> FLUSH privileges;

● Now check and ensure the said information in proxysqlmysql -h 127.0.0.1 -P 6032 -uadmin -padmin (connecting to proxysql)

select * from mysql_users; .... mysql_servers; schedulers;

● Check if proxysql is able to route connection to servermysql --user=proxysql_user --password=passw0rd --host=localhost

--port=6033 --protocol=tcp -e “select @@hostname”

exercise

● Check the routing rules through mysql_server table of proxysql.node1 mysql> select * from mysql_servers;

+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+

| hostgroup_id | hostname | port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |

+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+

| 10 | 192.168.70.2 | 3306 | ONLINE | 1000000 | 0 | 1000 | 0 | 0 | 0 | WRITE |

| 11 | 192.168.70.3 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |

| 11 | 192.168.70.4 | 3306 | ONLINE | 1000 | 0 | 1000 | 0 | 0 | 0 | READ |

+--------------+--------------+------+--------+---------+-------------+-----------------+---------------------+---------+----------------+---------+

3 rows in set (0.00 sec)

● Start monitoring to check if the proxysql is multiplexing the query among 2 host.watch -n 0.1 "mysql -u proxysql_user -P 6033 -ppassw0rd

-h 192.168.70.3 --protocol=tcp -e \"select @@wsrep_node_name\""


exercise

● Let’s shutdown node-2 and check multiplexing.

Monitor statuswatch -n 0.1 "mysql -uadmin -padmin -P 6032 -h 127.0.0.1 -e \"select status from mysql_servers\""

Monitor pxc_maint_modewatch -n 0.1 "mysql -e \"select @@pxc_maint_mode\""

Shutdown and then restart node-2systemctl stop mysql (check if multiplexing now stabilizes to node3)

systemctl start mysql


exercise

track more using show status

● Often a node is taken down (shutdown of server) for some maintenance and re-starting it can cause

IST. Size of IST depends on write-set processed. It can take noticeable time for IST activity to

complete (While IST is in progress node is not usable).

● Now we can track IST progress through ‘wsrep_ist_receive_status’| wsrep_ist_receive_status | 75% complete, received seqno 1980 of 1610-2103 |

watch -n 0.1 "mysql -e \"show status like 'wsrep_ist_receive_status'\""

Let’s see this happening. We will shutdown node-2

do some quick 30 sec OLTP workload on node-1

and then restart node-2 and track IST.

Check what is the wsrep_last_committed post IST.

tracking ist

exercise

● Flow-control is self-regulating mechanism to avoid a case where-in weakest/slower node the cluster

falls behind main-cluster by a significant gap.

● Each replicating node has a incoming/receiving queue.

○ Size of this queue is configurable.

○ Default for now is 100. (Was 16 initially)

○ If there are multiple nodes in cluster then this size is adjusted based on number of nodes.

(sqrt(number-of-nodes) * configured size)

○ Size configurable through gcs.fc_limit variable.Default for now is 100. (Was 16 initially)

Setting on one node is not replicated to other nodes.

tracking flow control

● Starting new release this size is exposed through a variable 'wsrep_flow_control_interval'.

mysql> show status like ' wsrep_flow_control_interval';

+-----------------------------+--------------+

| Variable_name | Value |

+-----------------------------+--------------+

| wsrep_flow_control_interval | [ 141, 141 ] |

+-----------------------------+--------------+

1 row in set (0.01 sec)

● Concept of lower and higher water mark. (gcs.fc_factor)

Exercise time: We will change

flow-control limit using

gcs.fc_limit and see if the effect.


exercise

● Often a node enters flow-control depending on size of the queue and load of the server.

● It is important to help track if the node is in flow-control. Till date this was difficult to find out.

● Now we have an easy way out wsrep_flow_control_statusmysql> show status like 'wsrep_flow_control_status';

+---------------------------+-------+

| Variable_name | Value |

+---------------------------+-------+

| wsrep_flow_control_status | ON |

+---------------------------+-------+

● Variable is boolean and will be turned ON and OFF depending on if node is in flow-control.

We will force node to pause and see

if the FLOW_CONTROL KICKS IN.

* sysbench based flow-control


exercise

● Based on the reading so far it may sound interesting to increase flow-control as the node will

eventually catch-up. So why not increase flow-control to a larger value (say 10K, 25K ?)

● It has 3 main back-falls

(Say you set fc_limit to 10K and you have 8K write-sets replicated on slave waiting to get applied)

○ If slave fails then it loses those 8K write-sets and on restart will demand those 8K+ write-set through

IST (even if you immediately restart the node in few seconds).

○ If the DONOR gcache doesn’t have these write-sets since the gap is large it will invite SST that could

be time-consuming.

○ If you care about fresh data and have wsrep_sync_wait=7 firing SELECT query will place it at 8K+1

slot and will wait for those 8K write-set to get serviced.. Increasing SELECT (DML) latency

effect of increasing flow control

improved logging and stages

● We have improved sst/ist logging. Will love to hear back if you have more comments on this front.

● Major changes:

○ Made logging structure easy to grasp.

○ FATAL error are now highlighted to sight them easily.

○ In case of FATAL error XB logs are appended to main node logs.

○ Lot of sanity checks inserted to help catch error early.

○ wsrep-log-debug=1 ([sst] section) to suppress those informational message for normal run.

● Some major causes of error

○ Wrong config params (usually credential or port or host)

○ Stale process from last run or some other failed attempt left out.

improved sst/ist logging

2017-04-14T10:08:30.336898Z 1 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) ->

new(461a9eac-20fa-11e7-a31c-4f98008f8d11:0)

20170414 15:38:30.533 WSREP_SST: [INFO] Proceeding with SST.........

20170414 15:38:30.567 WSREP_SST: [INFO] ............Waiting for SST streaming to complete!

2017-04-14T10:08:32.451749Z 0 [Note] WSREP: (4b82fdcc, 'tcp://192.168.1.4:5030') turning message relay requesting off

2017-04-14T10:08:42.283267Z 0 [Note] WSREP: 0.0 (n1): State transfer to 1.0 (n2) complete.

2017-04-14T10:08:42.283423Z 0 [Note] WSREP: Member 0.0 (n1) synced with group.

20170414 15:38:43.075 WSREP_SST: [INFO] Preparing the backup at /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2//.sst

20170414 15:38:46.879 WSREP_SST: [INFO] Moving the backup to /opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/

20170414 15:38:46.901 WSREP_SST: [INFO] Galera co-ords from recovery: 461a9eac-20fa-11e7-a31c-4f98008f8d11:0

2017-04-14T10:08:46.913474Z 0 [Note] WSREP: SST complete, seqno: 0


2017-04-14T10:11:09.452740Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer.

2017-04-14T10:11:09.452837Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner'

--address '192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file

'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' )

20170414 15:41:09.523 WSREP_SST: [ERROR] ******************* FATAL ERROR **********************

20170414 15:41:09.525 WSREP_SST: [ERROR] The xtrabackup version is 2.3.5. Needs xtrabackup-2.4.4 or higher to perform SST

20170414 15:41:09.526 WSREP_SST: [ERROR] ******************************************************

2017-04-14T10:11:09.527081Z 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address

'192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file

'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin'

Read: '(null)'

2017-04-14T10:11:09.527123Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address

'192.168.1.4:5000' --datadir '/opt/projects/codebase/pxc/installed/pxc57/pxc-node/dn2/' --defaults-file

'/opt/projects/codebase/pxc/installed/pxc57/pxc-node/n2.cnf' --defaults-group-suffix '' --parent '3433' --binlog 'mysql-bin' : 2 (No

such file or directory)

2017-04-14T10:11:09.527169Z 2 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.

2017-04-14T10:11:09.527189Z 2 [ERROR] Aborting


● wsrep-stages are important to track what the given thread is working on at any given time.

● MASTER node:| 7408 | root | 192.168.1.4:49938 | test | Query | 0 | wsrep: pre-commit/certification passed (15406) | COMMIT

| 0 | 0 |

● Replicating/Slave node:| 1569 | system user | | NULL | Sleep | 0 | wsrep: committed write set (14001) | NULL

| 0 | 0 |

Let US RUN some quick workload (Say UPDATE_INDEX)

and track some of the IMPORTANT stages that you

may see OFTEN.

improved wsrep stages

exercise

pxc performance

pxc performance

pxc performance

pxc performance

pxc performance

● MAJOR MILESTONE IN PXC HISTORY.● Performance improved for all kind of workload and configuration

○ sync_binlog=0/1

○ log-bin=enabled/disabled

○ OLTP/UPDATE_NON_INDEX/UPDATE_INDEX.

○ Scale even with 1024 threads and 3 nodes.

pxc performance

● So what were the problems and how we solved it.

● We found 3 major issues that blocked exploring effective use of

○ REDO log flush parallelism

○ Group-Commit protocol (LEADER-FOLLOWER batch operation protocol)

pxc performance

● How commit operates:○ COMMIT (2 steps)

■ prepare (mark trx as prepare, redo flush (if log-bin=0)). If operating in cluster-mode (PXC) replicate the

transaction + enforce ordering.

■ commit (flush bin logs + redo flush + commit innodb)

● PROBLEM-1: CommitMonitor was exercised such that REDO flush get serialized.

○ Split replicate + enforce ordering

● PROBLEM-2: Group-Commit Logic is optimized for batch load but with Commit Ordering in PXC, only 1 transaction

was forward to GC logic.

○ Release Commit Ordering early and rely on GC-Logic to enforce ordering. (interim_commit)

● PROBLEM-3: Second fsync/redo-flush (with log-bin=0) was again serialized.

○ Release ComMon as order is already enforced + trx is committed in memory. Only redo log flush is pending.

pxc performance

pxc + pmm

● PXC is fully integrated with PMM.

● PMM helps monitor different aspect of PXC. Let’s see some of the key aspect:

○ How long can a node sustain without need for SST (projected based on historical data).

○ Size of Replicated write-set.

○ Conflict

○ Basic statistical information (as to how many nodes are up and running). If node goes down

that is reflected.

○ Is node in DESYNCed state or FLOW-Control ?

Exercise: pmmdemo.percona.com

monitor pxc through pmm

exercise

misc

● Starting PXC-5.7 we have simplified the packaging.

○ PXC-5.7 is single package that you need to install for all PXC components (mainly pxc server

+ replication library). garbd continue to be remains as separate package.

○ External tools like xtrabackup, proxysql are available from percona repo.

○ PXC is widely supported on most of the ubuntu + debian + centos platform.

simplified packaging

● PXC is MySQL/PS-5.7.17 compatible and we normally continue to keep it refresh with major release

of MySQL version (typically once per quarter).

● PXC is compatible with Codership Cluster (5.7 GAed few months back).

mysql/ps-5.7.17 compatible

tips to run successful cluster

TIP-1: All nodes of the cluster should have same configuration settings (except

those setting that are h/w dependent or node specific)

○ For example: wsrep_max_ws_size is used to control the large transaction, gcs.fc_limit

(configured through wsrep_provider_options), wsrep_replicate_myisam (pxc-strict-mode

doesn’t allow it any more), etc…

○ You should review the default setting and adjust it as per your environment but avoid changing

settings that you are not sure about. For example: wsrep_sync_wait helps read the fresh data

but increases SELECT latency. It is recommended to set this only if there is need for fresh

data.


TIP-2: Ensure timeouts are configured as per the your environment.

○ Default timeout set in PXC works in most cases (mainly for LAN setup) but each n/w setup is

different and so understand the latency, server configuration, load server need to handle and

accordingly set the timeouts.

○ PXC loves stable network connectivity. So try to install PXC on server with stable network

connectivity.

○ GARBD may not need a high-end configuration server but still need a good network

connectivity.


TIP-3: Monitor your PXC cluster regularly.

○ Monitoring is must for any system for its smooth operation and PXC is not an exception.

○ Plan outages carefully, if you expect a sudden increase workload for small time make sure

enough resources are allocated for PXC to handle the increased workload (say

wsrep_slave_threads).

○ Ensure that nodes doesn’t keep on leaving the cluster on/off without any know reason.

○ Don’t boot multiple nodes to join through SST all of sudden. One at a time. (Node is offered

membership first and then data transfer begin).


TIP-4: Understand your workload○ Workload of each user is different and it is important to understand what kind of workload your

application is generating.

○ One of such user observed that his cluster drags at some point of day and then it take good

time for the cluster to catch-up. Application of user was firing ALTER command on a large

table while the DML was in progress causing complete stall of DML command. On restoration

DML workload continued and slaved failed to catch-up due to insufficient resources (read

slave-thread).

○ Avoid use of experimental features (strict-mode will anyway block them)

○ Double check the semantics while using commands like FTWRL, RSU, Events.


TIP-5: what to capture if you hit an issue○ Clear description of what kind of setup you had (3 node cluster, 2 nodes + garbd, master

replicating to cluster, slave, etc….)

○ my.cnf + command line options + show-status + any other configuration that you know you

have modified. (If they are different for different nodes then capture them for all the nodes).

○ Complete error.log from all of the nodes. (If it is involves SST failure then supporting error-logs

are present in data-dir (or hidden folder .sst))

○ What was state of cluster while it was working fine and rough idea of workload that was active

that pushed cluster in the current state.


thanks for attending

● We would love to hear what more you want from PXC or what problem you face in using PXC and

would be happy to assist you or take a feature request and put it on product roadmap.

● connect with us

○ forum: https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster

○ email: [email protected], [email protected]

○ drop in at Percona Booth

○ catch us during conference

connect with us

Documents

Percona XtraDB Cluster TutorialTable of Contents 1. Overview & Setup 6. Monitoring Galera 2. Migrating from M/S 7. Incremental State Transfer 3. Galera Replication Internals 8. Node