Galera Replication Demystified: How Does It Work?

Preview:

Citation preview

Galera Replication Demystified

how does it work ?

 

 

1/262

about.me/lefred

Who am I ?

 

 

2/262

Frédéric Descamps@lefred

 

 

3/262

Frédéric Descamps@lefredWorking for Percona since 2011

 

 

4/262

Frédéric Descamps@lefredWorking for Percona since 2011Senior Architect

 

 

5/262

Frédéric Descamps@lefredWorking for Percona since 2011Senior ArchitectManaging MySQL since 3.23

 

 

6/262

Frédéric Descamps@lefredWorking for Percona since 2011Senior ArchitectManaging MySQL since 3.23devops believer

 

 

7/262

Frédéric Descamps@lefredWorking for Percona since 2011Senior ArchitectManaging MySQL since 3.23devops believerand I installed my first Galera Cluster in February 2010 ;-)

 

 

8/262

Galera Replication

Cluster

 

 

9/262

Galera Replication - Cluster

What is it ?

What does it handle ?

 

 

10/262

Standard asyncrhonous replication is server-centric, one serverstreams data to another one. All the nodes have a specific role.

 

 

11/262

In Galera, the dataset is synchronized between one or more servers:data-centric

 

 

12/262

You can write to any node in your cluster No need to worry abouteventual out-of-sync

 

 

13/262

Write events/transactions are sent in parallel

 

 

14/262

Cluster Membershipdetermined by the clusterwsrep_cluster_address is just a pointerany node is permitted to join that

knows the cluster namecan find a single active cluster node

 

 

15/262

Cluster Membershipdetermined by the clusterwsrep_cluster_address is just a pointerany node is permitted to join that

knows the cluster namecan find a single active cluster node

㫙�㫙�㫙�㫘�㫙�㫘�㫘�㫙�㫙�㫙�㫙�㫘�㫙�㫘�㫘�㫘�㫘�㫙�㫘�㫙�㫙�㫔�㫖�㫔�㫘�㫘�㫙�㫙�㫙�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�

 

 

16/262

The cluster manages quorum

and has split-brain protection.

 

 

17/262

 

 

18/262

 

 

19/262

 

 

20/262

 

 

21/262

 

 

22/262

 

 

23/262

Replication

 

 

24/262

ReplicationDelivers the writeset to all nodes in the cluster

 

 

25/262

ReplicationDelivers the writeset to all nodes in the cluster

and all nodes acknowledge the writeset

 

 

26/262

ReplicationDelivers the writeset to all nodes in the cluster

and all nodes acknowledge the writeset

 

 

27/262

ReplicationDelivers the writeset to all nodes in the cluster

and all nodes acknowledge the writesetCost is ~roundtrip latency to furthest node

 

 

28/262

ReplicationDelivers the writeset to all nodes in the cluster

and all nodes acknowledge the writesetCost is ~roundtrip latency to furthest nodeSerialized by Group Communication

 

 

29/262

GTID

 

 

30/262

GTIDNot the same as 5.6 Aynchronous GTID's

 

 

31/262

GTIDNot the same as 5.6 Aynchronous GTID's

though they appear the same

 

 

32/262

GTIDNot the same as 5.6 Aynchronous GTID's

though they appear the same939aac77-f7d1-11e3-bd5e-b211d6ab1ec6:1534285

 

 

33/262

GTIDNot the same as 5.6 Aynchronous GTID's

though they appear the same939aac77-f7d1-11e3-bd5e-b211d6ab1ec6:1534285

GTIDs ensure cluster members are consistent with each other

 

 

34/262

GTIDNot the same as 5.6 Aynchronous GTID's

though they appear the same939aac77-f7d1-11e3-bd5e-b211d6ab1ec6:1534285

GTIDs ensure cluster members are consistent with each othernodes joining a cluster have their GTIDs checked

 

 

35/262

GTIDNot the same as 5.6 Aynchronous GTID's

though they appear the same939aac77-f7d1-11e3-bd5e-b211d6ab1ec6:1534285

GTIDs ensure cluster members are consistent with each othernodes joining a cluster have their GTIDs checked

GTIDs can be used to compare downed nodes to each other

 

 

36/262

GTIDThe highest GTID is the most recently written

Generally the best practice it to bootstrap the node with the most recentdata

 

 

37/262

GTID

 

 

38/262

GTID

 

 

39/262

GTID

 

 

40/262

Global Transaction IDsinitial dataset

bfb912e5-f560-11e2-0800-1eefab05e57d:0

 

 

41/262

Global Transaction IDsinitial dataset

bfb912e5-f560-11e2-0800-1eefab05e57d:0

first change/transaction/writeset

bfb912e5-f560-11e2-0800-1eefab05e57d:1

 

 

42/262

Global Transaction IDsinitial dataset

bfb912e5-f560-11e2-0800-1eefab05e57d:0

first change/transaction/writeset

bfb912e5-f560-11e2-0800-1eefab05e57d:1

undefined GTID

00000000-0000-0000-0000-000000000000:-1

 

 

43/262

Global Transaction IDs : Galera vsMySQL 5.6

 

 

44/262

Global Transaction IDs : Galera vsMySQL 5.6

 

 

45/262

Global Transaction IDs : Galera vsMySQL 5.6

 

 

46/262

Global Transaction IDs : Galera vsMySQL 5.6

In MySQL 5.6㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫘�㫕�㫕�㫕�㫘�㫔�㫙�㫘�㫙�㫔�㫙�㫘�㫙�㫙�㫘�㫙�㫔�㫙�㫙�㫙�㫙�㫙�㫙�㫘�㫘�㫔�㫘�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�

 

 

47/262

Global Transaction IDs : Galera vsMySQL 5.6

In Galera㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�

 

 

48/262

Global Transaction IDs : Galera vsMySQL 5.6

In Galera㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫔�㫙�㫘�㫙�㫔�㫙�㫘�㫙�㫙�㫘�㫙�㫔�㫙�㫙�㫙�㫙�㫙�㫙�㫘�㫘�㫔�㫘�

 

 

49/262

Global Transaction IDs : Galera vsMySQL 5.6

In Galera㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫔�㫙�㫘�㫙�㫔�㫙�㫘�㫙�㫙�㫘�㫙�㫔�㫙�㫙�㫙�㫙�㫙�㫙�㫘�㫘�㫔�㫘�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫘�㫘�㫕�㫕�㫘�㫕�㫕�㫘�㫘�㫕�㫕�㫕�㫕�㫕�

 

 

50/262

GTID Assignment

UUID

The UUID section, 128-bit, is generated during bootstrapping to identifythe cluster.

Generated by mixing the timer value and pseudo-random numbers(depending primarily from the timer and PID), but, currently, there is nouse of NIC's MAC-address although in theory it may be done that way.

More info: https://gist.github.com/lefred/88a2cec88d03854d9934

 

 

51/262

GTID Assignment (3)

SEQNO

The seqno, 64-bit, is incremented only when the transaction passescertification and is ready for commit.

For the curious, the algorithm used to derive sequence number is TotemSingle-ring Ordering protocol.

Before that, there is already communication between the nodes, groupcommunication is used to define a group-channel id which is a locallymaintained counter by each node in sync with the group.

 

 

52/262

Serialization of writesets

 

 

53/262

Serialization of writesets

 

 

54/262

Serialization of writesets

 

 

55/262

Serialization of writesets

 

 

56/262

Serialization of writesets

 

 

57/262

Serialization of writesets

 

 

58/262

Serialization of writesets

 

 

59/262

Serialization of writesets

 

 

60/262

Serialization of writesets

 

 

61/262

Serialization of writesets

 

 

62/262

Serialization of writesets

 

 

63/262

Serialization of writesets

 

 

64/262

Serialization of writesets

 

 

65/262

Serialization of writesets

 

 

66/262

RolesWe have 4 distinct roles in Galera:

2 for replication2 for state transfer

 

 

67/262

Replication RolesWithin the cluster, all nodes are equal

 

 

68/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

 

 

69/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

The node a given transaction was written and committed on.

 

 

70/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

The node a given transaction was written and committed on.'slave/joiner node'

 

 

71/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

The node a given transaction was written and committed on.'slave/joiner node'

The node that received the given transaction via Galerareplication.

 

 

72/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

The node a given transaction was written and committed on.'slave/joiner node'

The node that received the given transaction via Galerareplication.

The terms master and slave in this context are only relevant for agiven transaction

 

 

73/262

Replication RolesWithin the cluster, all nodes are equal'master/donor node'

The node a given transaction was written and committed on.'slave/joiner node'

The node that received the given transaction via Galerareplication.

The terms master and slave in this context are only relevant for agiven transactionWriteset: Galera’s term for a transaction. One or more RBR rowchanges

 

 

74/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

 

 

75/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew node

 

 

76/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew nodeDonor = Node giving a copy of the datadir

 

 

77/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew nodeDonor = Node giving a copy of the datadir

State Snapshot transfer

 

 

78/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew nodeDonor = Node giving a copy of the datadir

State Snapshot transferFull backup of Donor to Joiner

 

 

79/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew nodeDonor = Node giving a copy of the datadir

State Snapshot transferFull backup of Donor to Joiner

Incremental Snapshot transfer

 

 

80/262

State Transfer RolesNew nodes joining an existing cluster get provisioned automatically

Joiner = Mew nodeDonor = Node giving a copy of the datadir

State Snapshot transferFull backup of Donor to Joiner

Incremental Snapshot transferOnly changes since node left cluster

 

 

81/262

WritesetRBR payload (black box to Galera)

 

 

82/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

 

 

83/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keys

 

 

84/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keysUnique keys

 

 

85/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keysUnique keysForeign Keys

 

 

86/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keysUnique keysForeign KeysTable names

 

 

87/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keysUnique keysForeign KeysTable namesSchema names

 

 

88/262

WritesetRBR payload (black box to Galera)Replication keys (generated by master/donor node)

Primary keysUnique keysForeign KeysTable namesSchema names

Keys are what make certification possible

 

 

89/262

Replication ?It consists in 4 operations:

ApplyReplicationCertificationCommit

The order differs with the node's role

 

 

90/262

Replication Order on Master/Donor1. Apply

2. Replication3. Certification4. Commit

 

 

91/262

Replication Order on Slave/Joiner1. Replication (from master/donor)

2. Certification3. Apply4. Commit

 

 

92/262

CertificationCan this writeset be applied?

 

 

93/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donor

 

 

94/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

 

 

95/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every node

 

 

96/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every node

 

 

97/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every nodeResults are not reported to the cluster

 

 

98/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every nodeResults are not reported to the cluster

Pass: enter apply queue (commit success on master/donor)

 

 

99/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every nodeResults are not reported to the cluster

Pass: enter apply queue (commit success on master/donor)Fail: drop transaction (or return deadlock on master/donor)

 

 

100/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every nodeResults are not reported to the cluster

Pass: enter apply queue (commit success on master/donor)Fail: drop transaction (or return deadlock on master/donor)

Serialized by group communication sequence (and GTID will besynchronized following the same sequence)

 

 

101/262

CertificationCan this writeset be applied?

Based on unapplied earlier transactions on master/donorSuch conflicts must come from other nodes

Happens on every nodeShould be deterministic on every nodeResults are not reported to the cluster

Pass: enter apply queue (commit success on master/donor)Fail: drop transaction (or return deadlock on master/donor)

Serialized by group communication sequence (and GTID will besynchronized following the same sequence)Cost based on # of keys or # of rows

 

 

102/262

ApplyApply is done on slave nodes after certification

 

 

103/262

ApplyApply is done on slave nodes after certificationCan be parallelized

 

 

104/262

ApplyApply is done on slave nodes after certificationCan be parallelized

if wsrep_slave_threads > 1

 

 

105/262

ApplyApply is done on slave nodes after certificationCan be parallelized

if wsrep_slave_threads > 1if there are no other writesets with conflicting keys also beingapplied

 

 

106/262

ApplyApply is done on slave nodes after certificationCan be parallelized

if wsrep_slave_threads > 1if there are no other writesets with conflicting keys also beingapplied

Cost: size of transaction

 

 

107/262

ApplyApply is done on slave nodes after certificationCan be parallelized

if wsrep_slave_threads > 1if there are no other writesets with conflicting keys also beingapplied

Cost: size of transactionGenerates brute force aborts on local node for conflicts

 

 

108/262

CommitFinal local InnoDB commit

 

 

109/262

CommitFinal local InnoDB commit

i.e., innodb_flush_log_at_trx_commit

 

 

110/262

CommitFinal local InnoDB commit

i.e., innodb_flush_log_at_trx_commitGTID gets generated

 

 

111/262

CommitFinal local InnoDB commit

i.e., innodb_flush_log_at_trx_commitGTID gets generatedDone by applier threads on slaves/joiners

 

 

112/262

CommitFinal local InnoDB commit

i.e., innodb_flush_log_at_trx_commitGTID gets generatedDone by applier threads on slaves/joinersDone by client thread on master/donor

 

 

113/262

CommitFinal local InnoDB commit

i.e., innodb_flush_log_at_trx_commitGTID gets generatedDone by applier threads on slaves/joinersDone by client thread on master/donorinnodb_flush_log_at_trx_commit=1 not required generally for PXC!

 

 

114/262

Galera Replication (autocommit)

 

 

115/262

Galera Replication (autocommit)

 

 

116/262

Galera Replication (autocommit)

 

 

117/262

Galera Replication (autocommit)

 

 

118/262

Galera Replication (autocommit)

 

 

119/262

Galera Replication (autocommit)

 

 

120/262

Galera Replication (autocommit)

 

 

121/262

Galera Replication (full transaction)

 

 

122/262

Galera Replication (full transaction)

 

 

123/262

Galera Replication (full transaction)

 

 

124/262

Galera Replication (full transaction)

 

 

125/262

Galera Replication (full transaction)

 

 

126/262

Galera Replication (full transaction)

 

 

127/262

Galera Replication (full transaction)

 

 

128/262

Galera Replication (full transaction)

 

 

129/262

Galera Replication (full transaction)

 

 

130/262

Optimistic LockingTraditional locking

 

 

131/262

Optimistic LockingTraditional locking

 

 

132/262

Optimistic LockingTraditional locking

 

 

133/262

Optimistic LockingTraditional locking

 

 

134/262

Optimistic LockingTraditional locking

 

 

135/262

Optimistic LockingTraditional locking

 

 

136/262

Optimistic LockingOptimistic Locking

 

 

137/262

Optimistic LockingOptimistic Locking

 

 

138/262

Optimistic LockingOptimistic Locking

 

 

139/262

Optimistic LockingOptimistic Locking

 

 

140/262

Optimistic LockingOptimistic Locking

 

 

141/262

Optimistic LockingOptimistic Locking

 

 

142/262

Optimistic LockingOptimistic Locking

 

 

143/262

Certification Failure

 

 

144/262

Certification Failure

Trx1 is open on 㫗�㫙�㫘�㫘�㫕�Trx2 is open on 㫗�㫙�㫘�㫘�㫕�

 

 

145/262

Certification Failure

㫗�㫙�㫘�㫘�㫕� gets 㫖�㫗�㫗�㫗�㫗�㫗�

 

 

146/262

Certification Failure

Synchronous replication

 

 

147/262

Certification Failure

Certification testsrun in isolation on each node

 

 

148/262

Certification Failure

Certification tests:asynchronous

 

 

149/262

Certification Failure

Synchronous replicationdeterministic

 

 

150/262

Certification Failure

Certification succeeds

 

 

151/262

Certification Failure

Certified transaction goes to the apply queue

 

 

152/262

Certification Failure

 

 

153/262

Certification Failure

On 㫗�㫙�㫘�㫘�㫕�, a successful cert test, means an actual 㫘�㫙�㫙�㫙�㫘�㫙�

 

 

154/262

Certification Failure

and transactions in the apply queue (㫗�㫙�㫘�㫘�㫕� & 㫗�㫙�㫘�㫘�㫕�) are executedasynchronously

 

 

155/262

Certification Failure

On 㫙�㫙�㫘�㫘�㫕� we commit the transaction

 

 

156/262

Certification Failure

Synchronous replication

 

 

157/262

Certification Failure

 

 

158/262

Certification Failure

 

 

159/262

Certification Failure

 

 

160/262

Certification Failure

 

 

161/262

Certification Failure

 

 

162/262

Certification Failure

 

 

163/262

Certification Failure

 

 

164/262

Brute Force Abort (bfa)

 

 

165/262

Brute Force Abort (bfa)

 

 

166/262

Brute Force Abort (bfa)

 

 

167/262

Brute Force Abort (bfa)

 

 

168/262

Brute Force Abort (bfa)

 

 

169/262

Brute Force Abort (bfa)

 

 

170/262

Brute Force Abort (bfa)

 

 

171/262

Local Certification Failure (lcf)

 

 

172/262

Local Certification Failure (lcf)

 

 

173/262

Local Certification Failure (lcf)

 

 

174/262

Local Certification Failure (lcf)

 

 

175/262

Local Certification Failure (lcf)

 

 

176/262

Local Certification Failure (lcf)

 

 

177/262

Local Certification Failure (lcf)

 

 

178/262

Local Certification Failure (lcf)

 

 

179/262

Local Certification Failure (lcf)

 

 

180/262

Local Certification Failure (lcf)

 

 

181/262

Local Certification Failure (lcf)

 

 

182/262

Local Certification Failure (lcf)

 

 

183/262

Certification Errors: summary

 

 

184/262

Certification Errors: summary

 

 

185/262

Certification Errors: summary

 

 

186/262

Flow ControlAbility of any node in the cluster to ask the rest of the nodes topause writes while it catches up

 

 

187/262

Flow ControlAbility of any node in the cluster to ask the rest of the nodes topause writes while it catches upFeedback mechanism for replication process

 

 

188/262

Flow ControlAbility of any node in the cluster to ask the rest of the nodes topause writes while it catches upFeedback mechanism for replication processONLY caused by 㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫙�㫘�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘� exceeding a node's㫘�㫘�㫘�㫙�㫘�㫙�㫘�㫙�

 

 

189/262

Flow ControlAbility of any node in the cluster to ask the rest of the nodes topause writes while it catches upFeedback mechanism for replication processONLY caused by 㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫙�㫘�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘� exceeding a node's㫘�㫘�㫘�㫙�㫘�㫙�㫘�㫙�CAN pause the entire cluster and look like a cluster stall !

 

 

190/262

Tuning Flow ControlToo low:

 

 

191/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the cluster

 

 

192/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

 

 

193/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

increase in replication conflicts in multi-node writings

 

 

194/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

increase in replication conflicts in multi-node writingsincrease in apply lag on nodes

 

 

195/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

increase in replication conflicts in multi-node writingsincrease in apply lag on nodesincrease in commit lag on nodes

 

 

196/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

increase in replication conflicts in multi-node writingsincrease in apply lag on nodesincrease in commit lag on nodes

One node with FC issues

 

 

197/262

Tuning Flow ControlToo low:

frequent FC from any and all nodes in the clusterToo high:

increase in replication conflicts in multi-node writingsincrease in apply lag on nodesincrease in commit lag on nodes

One node with FC issuesdeal with that node -bad hardware? too slow ?

 

 

198/262

Flow Control

 

 

199/262

Flow Control

 

 

200/262

Flow Control

 

 

201/262

Flow Control

 

 

202/262

Flow Control

 

 

203/262

Flow Control

 

 

204/262

Flow Control

 

 

205/262

Flow Control

 

 

206/262

Flow Control

 

 

207/262

Flow Control

 

 

208/262

Flow Control

 

 

209/262

Flow Control

 

 

210/262

Flow Control

 

 

211/262

Flow Control

 

 

212/262

Flow Control

 

 

213/262

Flow Control

 

 

214/262

Flow Control

 

 

215/262

Flow Control

 

 

216/262

Flow Control

 

 

217/262

Flow Control

 

 

218/262

Flow Control

 

 

219/262

State TransferThere are two types of State Transfer in Galera:

 

 

220/262

State TransferThere are two types of State Transfer in Galera:

1. SST (Snapshot State Transfer): full data copy

 

 

221/262

State TransferThere are two types of State Transfer in Galera:

1. SST (Snapshot State Transfer): full data copyrsyncmysqldumpxtrabackup

 

 

222/262

State TransferThere are two types of State Transfer in Galera:

1. SST (Snapshot State Transfer): full data copyrsyncmysqldumpxtrabackup

2. IST (Incremental State Transfer): only copy the missing events

 

 

223/262

State TransferThere are two types of State Transfer in Galera:

1. SST (Snapshot State Transfer): full data copyrsyncmysqldumpxtrabackup

2. IST (Incremental State Transfer): only copy the missing events

It's always better to try to avoid SST!

 

 

224/262

State TransferThere are two types of State Transfer in Galera:

1. SST (Snapshot State Transfer): full data copyrsyncmysqldumpxtrabackup

2. IST (Incremental State Transfer): only copy the missing events

It's always better to try to avoid SST!

㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫙�㫘�㫘�㫙�㫙�㫙�㫙� can be used to specify the donor

 

 

225/262

State Transfer: grastate.datWhen MySQL starts it checks 㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫙� file (in datadir)

 

 

226/262

State Transfer: grastate.datWhen MySQL starts it checks 㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫙� file (in datadir)This file placeholds GTID between MySQL restarts

 

 

227/262

State Transfer: grastate.datWhen MySQL starts it checks 㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫙� file (in datadir)This file placeholds GTID between MySQL restartsContains UUID and Seqno

 

 

228/262

State Transfer: grastate.datWhen MySQL starts it checks 㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫙� file (in datadir)This file placeholds GTID between MySQL restartsContains UUID and Seqno

㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫘�㫘�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

 

 

229/262

State Transfer: grastate.datWhen MySQL starts it checks 㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫙� file (in datadir)This file placeholds GTID between MySQL restartsContains UUID and Seqno

㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫘�㫘�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

node shutdown cleanly

 

 

230/262

grastate.dat - example㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫘�㫘�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

 

 

231/262

grastate.dat - example㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫘�㫘�㫘�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫘�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫘�㫕�㫕�㫕�㫘�㫕�㫕�㫕�㫕�㫕�㫘�㫕�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

node is running or did not shutdown cleanly

 

 

232/262

grastate.dat - example㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

 

 

233/262

grastate.dat - example㫔�㫔�㫗�㫖�㫗�㫖�㫗�㫖�㫔�㫙�㫘�㫙�㫘�㫘�㫔�㫙�㫙�㫘�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫙�㫙�㫕�㫔�㫕�㫕�㫕�㫙�㫙�㫘�㫘�㫕�㫔�㫔�㫔�㫔�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫕�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫔�㫙�㫘�㫙�㫙�㫙�㫕�㫔�㫔�㫔�㫕�㫕�㫘�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫙�㫕�

node aborted, SST on next restart

 

 

234/262

State Transfer: ISTWhen a node starts, it knowns the UUID of the cluster it belonged andthe last sequence number it applied.

So, it sends that position to the other members of the cluster and if anode can send the next events (ws/trx), IST will be performed, if none,then SST will be triggered.

 

 

235/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

 

 

236/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific size

 

 

237/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer style

 

 

238/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer styledefault size is 128M

 

 

239/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer styledefault size is 128Mcan be increase via provider option 㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�

 

 

240/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer styledefault size is 128Mcan be increase via provider option 㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�

㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫙�㫙�㫘�㫘�㫘�㫙�㫘�㫙�㫙�㫙�㫘�㫙�㫙�㫙�㫔�㫖�㫔�㫔�㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�㫖�㫕�㫗�㫔�

 

 

241/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer styledefault size is 128Mcan be increase via provider option 㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�

㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫙�㫙�㫘�㫘�㫘�㫙�㫘�㫙�㫙�㫙�㫘�㫙�㫙�㫙�㫔�㫖�㫔�㫔�㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�㫖�㫕�㫗�㫔�Galare Cache is mmaped (I/O buffered to memory)

 

 

242/262

Galera CacheThose events are stored on the 㫘�㫘�㫙�㫘�㫙�㫘�㫕�㫘�㫘�㫘�㫘�㫘� file.

preallocated file with a specific sizeused to store the writesets in circular buffer styledefault size is 128Mcan be increase via provider option 㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�

㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫙�㫙�㫘�㫘�㫘�㫙�㫘�㫙�㫙�㫙�㫘�㫙�㫙�㫙�㫔�㫖�㫔�㫔�㫘�㫘�㫘�㫘�㫘�㫘�㫕�㫙�㫘�㫚�㫘�㫖�㫕�㫗�㫔�Galare Cache is mmaped (I/O buffered to memory)㫙�㫙�㫙�㫘�㫙�㫘�㫙�㫙�㫘�㫘�㫙�㫘�㫘�㫘�㫘�㫘�㫘�㫘�㫘�㫘�㫙�㫙�㫙�㫙�㫙� provide the first seqno present inthe cache for that node

 

 

243/262

Galera Cache & IST

 

 

244/262

Galera Cache & IST

 

 

245/262

Galera Cache & IST

 

 

246/262

Galera Cache & IST

 

 

247/262

Galera Cache & IST

 

 

248/262

Galera Cache & IST

 

 

249/262

Galera Cache & IST

 

 

250/262

Galera Cache & IST

 

 

251/262

Galera Cache & IST

 

 

252/262

Galera Cache & IST

 

 

253/262

Galera Cache & IST

 

 

254/262

Galera Cache & IST

 

 

255/262

Galera Cache & IST

 

 

256/262

Galera Cache & IST

 

 

257/262

Galera Cache & IST

 

 

258/262

Galera Cache & IST

 

 

259/262

Galera Cache & IST

 

 

260/262

Galera Cache & IST

 

 

261/262

Thank you !

 

 

262/262

Recommended