Monitoring with Pacemaker - Home | NETWAYS GmbH with Pacemaker Martin G. Loschwitz Who? General Disclaimer What I am not saying ... Pacemaker does NOT replace monitoring WTF? The only

Monitoring with Pacemaker Martin G. Loschwitz

Who?

General Disclaimer

What I am not saying ...

Pacemaker does NOT replace monitoring

WTF?

The only thing better than monitoring ...

... is moar monitoring!

PM does not replace monitoring ...

... but it adds another layer to it.

Now, imagine a typical 2-node cluster

DRBD

DRBD, the DRBD logo and LINBIT are registered trademarks of LINBIT Information Technologies GmbH. hastexo is not affiliated with the trademark owner.

DRBD


And assume that MySQL crashes

Icinga will notice. Somewhen.

Somebody will be alarmed.

And fix the problem soon. Hopefully.

5 minutes of downtime

An availability of 99,997% equals to ...

15 minutes of downtime

Applies to other software, too

Delay is bad

Can we automatize this?

Yes we can!

Nagios / Icinga hooks?

Better: Pacemaker!

Pacemaker can ...

... restart failed resources

... move failed resources

... stop failed resources

... kill nodes acting up

Sounds good, eh?

„So you want Pacemaker monitoring“

Use Pacemaker

1999

HB 1

1999

HB 1 HB 2

1999 2005

HB 1 PM HB 2

1999 2005 2009

HB 1 PM HB 2

1999 2005 2009

HB 1 PM HB 2

1999 2005 2009

HB 1 PM HB 2

1999 2005 2009

Monitoring in Pacemaker

Node monitoring

„Fail-Over“

Node health flags

Node health flags

node alice

node bob

property no-quorum-policy="ignore" \

stonith-enabled="false" \

node-health-strategy="migrate-on-red”

Node health flags

node alice

node bob

property no-quorum-policy="ignore" \

stonith-enabled="false" \

node-health-strategy="migrate-on-red”

alice:~$ crm_attribute --node alice --name '#health-temp’ --update 'red' --lifetime reboot

Helpful for SMART checking

Helpful for temp checking

Helpful for fan checking

Node Bias

migration-threshold

MySQL

migration threshold

primitive p_mysql ocf:heartbeat:mysql \

[…]

meta migration-threshold=“3”

migration threshold

primitive p_mysql ocf:heartbeat:mysql \

[…]

meta migration-threshold=“3”

alice:~$ crm resource cleanup p_mysql

STONITH

DRBD


DRBD


DRBD


What triggers STONITH

Sudden node disappearance

A resource that can’t be stopped

What STONITH needs

stonith:external/meatware

Resource monitoring

Not difficult

resource monitoring

primitive p_ip ocf:heartbeat:IPaddr2 \

params ip=192.168.122.120

resource monitoring

primitive p_ip ocf:heartbeat:IPaddr2 \

params ip=192.168.122.120 \

op monitor interval=20s timeout=10s

If things go wrong, Pacemaker will …

… restart the resource on the same host

resource monitoring

primitive p_vm-staging ocf:heartbeat:VirtualDomain \

params config="/etc/libvirt/qemu/staging.cfg”

resource monitoring


params config="/etc/libvirt/qemu/staging.cfg” \

op monitor interval="60s" timeout=“30s”

DRBD


DRBD monitoring

primitive p_drbd-backup ocf:linbit:drbd \

params drbd_resource="backup" \

op monitor interval="30s" role="Slave" \

op monitor interval=”25s" role="Master"

DRBD resource level fencing


The code side of things

How does Pacemaker control resources?

Storage SAN DRBD

GlusterFS Ceph

Corosync

Pacemaker

MySQL RA

Cluster Messaging

Application Interface

Cluster Resource Management

Storage SAN DRBD

GlusterFS Ceph

Corosync

Pacemaker

MySQL RA

Cluster Messaging

Application Interface

Cluster Resource Management


CRMd (Cluster Resource Manager)

LRMd

MySQL RA

MySQL

LRMd (Local Resource Manager)

Application

Resource Agent (API)

Pacemaker

CRMd (Cluster Resource Manager)

LRMd

MySQL RA

MySQL

LRMd (Local Resource Manager)

Application

Resource Agent (API)

Pacemaker

Pacemaker uses Resource Agents

So improve the RA!

Good RAs follow the OCF standard

2 Examples

asterisk

asterisk_monitor {

[…]

ocf_run asterisk –rcx 'core show channels count’

if [ $rc -ne 0 ]; then

ocf_log err "Failed to connect to the Asterisk PBX”

return $OCF_ERR_GENERIC

fi

[…]

}

Asterisk (2)

asterisk_monitor {

[…]

if [ -n "$OCF_RESKEY_monitor_sipuri" ]; then

ocf_run sipsak -s "$OCF_RESKEY_monitor_sipuri”

rc=$?

case "$rc" in

1|2) return $OCF_ERR_GENERIC;;

3) return $OCF_NOT_RUNNING;;

esac

fi

}

VirtualDomain (libvirt)

for script in ${OCF_RESKEY_monitor_scripts}; do script_output="$($script 2>&1)” script_rc=$?

if [ ${script_rc} -ne ${OCF_SUCCESS} ]; then # A monitor script returned a non-success exit # code. Stop iterating over the list of scripts, log a # warning message, and propagate $OCF_ERR_GENERIC.

ocf_log warn "Monitor command \"${script}\" for domain ${DOMAIN_NAME} returned ${script_rc} w/ output: ${script_output}” rc=$OCF_ERR_GENERIC break else

ocf_log debug "Monitor command \"${script}\" for domain ${DOMAIN_NAME} completed successfully with output: ${script_output}”

fi done




op monitor interval="60" timeout="60”




monitor_scripts=“/usr/local/bin/mymon.sh” \

op monitor interval="60" timeout=“30”

One last thing

Recent GitHub outage

Put a node into standby mode

alice:~$ crm node standby alice

Enable maintenance-mode

alice:~$ crm configure property maintenance-mode=true

http://www.hastexo.com/category/tags/pacemaker

[email protected]

Documents

Monitoring with Pacemaker - Home | NETWAYS GmbH with Pacemaker Martin G. Loschwitz Who? General Disclaimer What I am not saying ... Pacemaker does NOT replace monitoring WTF? The only