Linux Tutorial: Deploying OpenSer Under Linux-HA - Heartbeat v2.0

heartbeat (or more formally, Linux-HA) provides application monitoring with the ability to restart or migrate a service (like OpenSER) and dependent resources (like IP addresses) to other machines in the event of a failure. Typically a monitoring process returns the status of a resource. (can be as simple as a ping or as complex as a full fledged application level test) In the event of a failure, a tree of services (typically the IP alias and the service that runs on top of it) are restarted or migrated to a new, more desirable node.

The Linux-HA project started as a simple process monitoring and failover application that didn't take service hierarchy into account among other things. Version 2 of Linux-HA was major rewrite of the application which added hierarchically defined services and used the industry standard OCF definition to describe service monitoring tools and dependency trees.

OCF files for the services are kept in /usr/lib/ocf/resource.d and are grouped by directories named after each provider. The included provider is heartbeat which supplies (among other things) IPaddr2 which I use for IP address setup, teardown and monitoring. It differs from IPaddr (also in that directory) in that it is iproute2 aware. The other provider I use is anders.com which contains the OpenSer OCF provider. This process controls and monitors OpenSer on the application level. (by using sipsak to send test calls to the application layer)

The service definition hierarchy is maintained in the /var/lib/heartbeat/crm/cib.xml file. This is the main file for configuring Linux-HA. It is VERY finicky.

During normal operation, the cib.xml file will be synchronized between all the nodes which means it will get rewritten. It contains the state information for the services being monitored and hashes for each of the nodes in the group. If you need to make a change to the cib.xml file, start by shutting down all of the nodes in the group. Make sure you keep all IDs unique across the file and be aware of the backup files in the same directory. It doesn't hurt to blow everything except for the cib.xml file away on all machines when heartbeat is stopped to make sure all nodes are in sync. Once you have made the changes you wish to make, increment the admin_epoch number in the cib.xml and copy it to each of the participating nodes. Start the preferred node before any others to minimize service migration.

The ha.cf file in /etc/ha.d configures some very basic heartbeat options. Most significantly, it dictates wither or not the CRM engine is on. This essentially differentiates between the old heartbeat version 1 and the new heartbeat version 2 with CRM support. (I use v2 with CRM) The ha.cf file also lists all the nodes that will be participating in the cluster and how inter-cluster communication will work. In this case we will be sending broadcasts from eth0.10, or VLAN 10 on eth0.

Note: It is very important to name the nodes what the output of the uname -a command reports. You can't just pick whatever name might sounds good to you unless you rename the machine itself.

/etc/ha.d/ha.cf

udpport 469
bcast eth0.10
node sip-a sip-b
crm on

When running multiple heartbeat setups on the same broadcast segment you must use a separate port for each setup.

Authkeys

As the name implies, the authkeys file lists the private strings the nodes will use as keys to authenticate communication between the nodes. As this is private data, the file should only be readable by root.

chown root:root /etc/ha.d/authkeys
chmod 600 /etc/ha.d/authkeys

The file lists the encryption method (md5) and the string to be used.

/etc/ha.d/authkeys

auth 1
1 md5 a0ff2cc2bbdff6c7a55090ea4f55400f

The cib.xml File

This is an example cib.xml file that assumes two IP addresses and runs OpenSER on them. In the event of a migration, the new node will start the IP addresses (sending a gratuitous arp) and then start OpenSER. The order of services (what depends on what) is described in this example.

/var/lib/heartbeat/crm/cib.xml

<cib>
<configuration>
<crm_config>
<cluster_property_set id="cluster-property-set">
<attributes>
<nvpair id="short_resource_names" name="short_resource_names" value="true"/>
<nvpair id="pe-input-series-max" name="pe-input-series-max" value="-1"/>
<nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="10"/>
<nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-10"/>
<nvpair id="start-failure-is-fatal" name="start-failure-is-fatal" value="false"/>
</attributes>
</cluster_property_set>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1194982799"/>
</attributes>
</cluster_property_set>
</crm_config>
<nodes />
<resources>
<group id="IPaddr2_OpenSer_group">
<primitive id="IPaddr2-1.2.3.4" class="ocf" type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-1.2.3.4-monitor" name="monitor" interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-1.2.3.4-attributes">
<attributes>
<nvpair id="ipaddr2-1.2.3.4-ip" name="ip" value="1.2.3.4"/>
<nvpair id="ipaddr2-1.2.3.4-broadcast" name="broadcast" value="1.2.3.255"/>
<nvpair id="ipaddr2-1.2.3.4-cidr_netmask" name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="IPaddr2-1.2.3.5" class="ocf" type="IPaddr2" provider="heartbeat">
<operations>
<op id="ipaddr2-1.2.3.5-monitor" name="monitor" interval="5s" timeout="3s"/>
</operations>
<instance_attributes id="IPaddr2-1.2.3.5-attributes">
<attributes>
<nvpair id="ipaddr2-1.2.3.5-ip" name="ip" value="1.2.3.5"/>
<nvpair id="ipaddr2-1.2.3.5-broadcast" name="broadcast" value="1.2.3.255"/>
<nvpair id="ipaddr2-1.2.3.5-cidr_netmask" name="cidr_netmask" value="24"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="OpenSer" class="ocf" type="OpenSer" provider="anders.com">
<operations>
<op id="openser-start" name="start" timeout="5s"/>
<op id="openser-stop" name="stop" timeout="3s"/>
<op id="openser-monitor" name="monitor" interval="10s" timeout="3s">
<instance_attributes id="monitor_10s">
<attributes>
<nvpair id="openser-monitor-ip" name="ip" value="127.0.0.1"/>
</attributes>
</instance_attributes>
</op>
</operations>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="OpenSer_resource_location" rsc="OpenSer">
<rule id="rule_sip-a" score="100">
<expression id="expression_uname_eq_sip-a" attribute="#uname" operation="eq" value="sip-a"/>
</rule>
<rule id="rule_sip-b" score="10">
<expression id="expression_uname_eq_sip-b" attribute="#uname" operation="eq" value="sip-b"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>

We verify cib files with crm_verify:

crm_verify -x /var/lib/heartbeat/crm/cib.xml

Make sure you set the ownership to cluster:cluster on that file and kill backup versions in the off chance they might conflict with the new cib.xml file.

rm /var/lib/heartbeat/crm/cib.xml.*
chown cluster:cluster -R /var/lib/heartbeat/crm/

OCF Files

I wrote my own OCF file for monitoring OpenSER which implements sipsak to do application level testing over 127.0.0.1. (make sure OpenSer listens on 127.0.0.1 as well)

/usr/lib/ocf/resource.d/OpenSer

#!/bin/sh

# Initialization:

. /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs

usage() {
cat <<-!
usage: $0 {start|stop|status|monitor|meta-data|validate-all}
!
}

meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="OpenSer">
<version>1.0</version>

<longdesc lang="en">
Resource Agent for the OpenSer SIP Proxy.
</longdesc>
<shortdesc lang="en">OpenSer resource agent</shortdesc>

<parameters>
<parameter name="ip" unique="0" required="1">
<longdesc lang="en">
IP Address of the OpenSer Instance. This is only used for monitoring.
</longdesc>
<shortdesc lang="en">IP Address</shortdesc>
<content type="string" default="" />
</parameter>

<parameter name="port" unique="0" required="1">
<longdesc lang="en">
Port of the OpenSer Instance. This is only used for monitoring.
</longdesc>
<shortdesc lang="en">Port</shortdesc>
<content type="string" default="5060" />
</parameter>
</parameters>

<actions>
<action name="start" timeout="30" />
<action name="stop" timeout="30" />
<action name="status" depth="0" timeout="30" interval="10" start-delay="30" />
<action name="monitor" depth="0" timeout="30" interval="10" start-delay="30" />
<action name="meta-data" timeout="5" />
<action name="validate-all" timeout="5" />
<action name="notify" timeout="5" />
<action name="promote" timeout="5" />
<action name="demote" timeout="5" />
</actions>
</resource-agent>
END
}

OpenSer_Status() {
#echo "/usr/bin/sipsak -s sip:test@$OCF_RESKEY_ip -H 127.0.0.1 2>/dev/null >/dev/null" > /tmp/a
/usr/bin/sipsak -s sip:test@$OCF_RESKEY_ip -H 127.0.0.1 2>/dev/null >/dev/null
rc=$?
if
[ $rc -ne 0 ]
then
return $OCF_NOT_RUNNING
else
return $OCF_SUCCESS
fi
}

OpenSer_Monitor( ) {
OpenSer_Status
}

OpenSer_Start( ) {
if
OpenSer_Status
then
ocf_log info "OpenSer already running."
return $OCF_SUCCESS
else
/etc/init.d/openser start >/dev/null
rc=$?
if
[ $rc -ne 0 ]
then
return $OCF_ERR_PERM
else
return $OCF_SUCCESS
fi
fi
}

OpenSer_Stop( ) {
/etc/init.d/openser stop >/dev/null
return $OCF_SUCCESS
}

OpenSer_Validate_All( ) {
return $OCF_SUCCESS
}

if [ $# -ne 1 ]; then
usage
exit $OCF_ERR_ARGS
fi

case $1 in
meta-data) meta_data
exit $OCF_SUCCESS
;;
start) OpenSer_Start
;;
stop) OpenSer_Stop
;;
monitor) OpenSer_Monitor
;;
status) OpenSer_Status
;;
validate-all) OpenSer_Validate_All
;;
notify) exit $OCF_SUCCESS
;;
promote) exit $OCF_SUCCESS
;;
demote) exit $OCF_SUCCESS
;;
usage) usage
exit $OCF_SUCCESS
;;
*) usage
exit $OCF_ERR_ARGS
;;
esac
exit $?

We use the OCF tester to check the validity of this OCF file. (Make sure you set the IP to the service address on your system. Be aware that this will start the service so it can test application monitoring and shutdown so don't run it on production IPs unless you know what you are doing.)

/usr/lib/ocf/resource.d/ocf-tester -o ip=127.0.0.1 /usr/lib/ocf/resource.d/anders.com/OpenSer

Hacks to the Standard Gentoo Heartbeat Build

I don't emerge heartbeat but rather build it from source. (heartbeat-2.1.2 as of this writing) However, older installs may have left an incompatible version of heartbeat installed whose elements can conflict. Typically this will show up in the logs as a crash of pengine or some other process heartbeat spawns. To avoid these errors, rm -fr /usr/lib/heartbeat and re-install.

To configure from source, build and install:

./ConfigureMe configure
make
make install

This will configure and build a setup with config files in a Gentoo-ish layout. You will find most important configuration in:

/etc/ha.d
/usr/lib/ocf/resource.d
/usr/local/var/run/heartbeat/crm

/etc/init.d/heartbeat uses killproc but that could be a little too random if you run multiple instances of openser on the same machine. However, it does write a PID file when it starts heartbeat so we change the killproc line to:

kill `cat $PIDFILE` &>/dev/null

Using killproc or killall might kill other instances of openser on the same machine so killing the master PID is a much better solution.

If monit is on the box, it is usually start/stopped from within the /etc/init.d/openser file. The conventional /etc/init.d/openser start command starts monit in this case and it in turn executes /etc/init.d/openser openserstart to get openser running. When I use heartbeat, there is no reason to use monit but I still have to start openser with /etc/init.d/openser openserstart. (you might need to change this in the OCF file above)

Heartbeat with OpenSer Checklist

When activating a heartbeat controlled OpenSER setup make sure to:

* Blow away any only heartbeat installs

rm -fr /var/lib/heartbeat
rm -fr /usr/lib/heartbeat

* Compile from source and install the latest-greatest tested release. (heartbeat-2.1.2 as of this writing)

./ConfigureMe configure
make
make install

* Edit /var/lib/heartbeat/crm/cib.xml to taste.
* Kill all old cib.xml.* files:

rm /var/lib/heartbeat/crm/cib.xml.*

* Set the file ownership on the crm directory and files:

chown cluster:cluster -R /var/lib/heartbeat/crm/

* Edit /etc/init.d/openser
o Make sure the correct version of openser gets started.
o The OCF file will want to run a /etc/init.d/openser start so make sure start will work or change the OCF to run the command openserstart instead if monit changed /etc/init.d/openser.
o Make sure killproc isn't used. Instead, kill the pid from the pidfile as mentioned above.
* Edit /etc/ha.d/ha.cf to make it look something like this:

udpport 469
bcast eth0.10
node sip-a sip-b
crm on

* Make sure you have sipsak on your box in /usr/bin/sipsak.

which sipsak

* Make sure the same IPs are specified in the openser.cfg and the cib.xml files.
* Make sure the IP used for monitoring is 127.0.0.1 in the cib.xml.
* Make sure that openser is listening on 127.0.0.1 as well as it's production IPs.
* In the case where a nameserver isn't reachable, OpenSer will hang on trying to reverse resolve the production IPs so add entries for them in /etc/hosts that reflect the real names so there are as few external dependencies as possible.
* Make sure OpenSER is configured to respond to OPTIONS messages on 127.0.0.1 so the OCF Tester can test the application-level health of OpenSer.
* Test sipsak to make sure it succeeds / fails when the service is on / off.

/usr/bin/sipsak -s sip:test@127.0.0.1 -H 127.0.0.1

* Make sure OpenSER has libpg.so.5 for the PostGreSQL module if you are using PostGreSQL. If not, install PostGreSQL 8.2.5 or later. (as of this writing)
* Add the production IPs to the box (or comment them out of openser.cfg) and use the OCF tester to make sure it can start / monitor / stop openser.

ip address add 1.2.3.4/24 dev eth0.10
ip address add 1.2.3.5/24 dev eth0.10

/usr/lib/ocf/resource.d/ocf-tester -o ip=127.0.0.1 /usr/lib/ocf/resource.d/anders.com/OpenSer

Make sure it says "/usr/lib/ocf/resource.d/anders.com/OpenSer passed all tests"

If heartbeat yammers like this:

Nov 28 20:52:28 sip-a heartbeat: [22070]: WARN: nodename sip-a uuid changed to sip-b
Nov 28 20:52:28 sip-a heartbeat: [22070]: debug: displaying uuid table
Nov 28 20:52:28 sip-a heartbeat: [22070]: debug: uuid=9052abe5-87ee-4400-a008-c5f13205e94b, name=sip-a
Nov 28 20:52:28 sip-a heartbeat: [22070]: ERROR: should_drop_message: attempted replay attack [sip-b]? [gen = 10, curgen = 21]

then kill this file:

rm /var/lib/heartbeat/hb_uuid

Controlling Heartbeat

To get an overview of what's going on, run:

crm_mon

To list the resources under control:

crm_resource -L

To push a resource off of this box:

crm_resource -M -r OpenSer

This creates a constraint scored at INFINITY saying that a resource should not run on this host.

To remove an INFINITY constraint placed by the above command:

crm_resource -U -r OpenSer

When a resource is moved off of a node because it can't be started (for example when the openser.cfg file is broken) the node is marked as bad and the resource is migrated to another node. After fixing the resource, you will need to clear the resource before it will migrate back to the primary. That is done like this:

crm_resource -C -r OpenSer

However, when a resource fails for whatever reason, it's failure count is incremented. To actually "fail-back" to the primary node you must also make sure the failure count is below the threshold for that resource. (A good practice is to set it back to 0)

To see the failure count:

crm_failcount -G -U sip-a -r OpenSer

To reset the failure count:

crm_failcount -v 0 -U aip-a -r OpenSer

Configuring when to move a service from node to node is done through scores assigned to individual nodes and the stickiness / failure-stickiness of resources.

The calculation is:

(sip-a score - sip-b score + stickiness) / abs(failure stickiness)

In our case, the settings are: sip-a = 100 sip-b = 10 default stickiness = 10 stickiness = 30 (10 for each resource: ip, ip, OpenSer) failure stickiness = -10 So: (sip-a - sip-b + stickiness) / abs(failure stickiness) = (100 - 10 + (10 + 10 + 10)) / 10 = 130 / 10 = 13 Therefore, in this case OpenSer can fail 13 times on sip-a before being moved to sip-b. Of course if a service fails to start, it is immediately moved and the node marked bad. This is desierable for a service that we don't want to see down because the service will in effect revert to the last known-good configuration running on the backup node. This allows us to fix our primary node while the service runs in backup. Manually Failing Back If OpenSER fails to start on a node, (for example when you have a broken config file) the node is marked as bad and a restart won't be attempted. To force a resource to fail back to the primary, you should reset the failure counts to 0 on the primary and backup: crm_failcount -v 0 -U sip-a -r OpenSer crm_failcount -v 0 -U sip-b -r OpenSer and clear the OpenSer resource so it forgets where it wasn't able to start. crm_resource -C -r OpenSer This should work in all cases. If the resource still migrates to the backup node, there is a good chance OpenSER is still broken on the primary node. lrmd CPU usage A patch for lrmd that reduces CPU usage is here: http://hg.linux-ha.org/dev/rev/0ded50597e97

Tags

Trackbacks

To send a trackback, use the URL of this story appending ?page=tb at the end.

Comments (1)

Anders from RTP

I've been asked for the OpenSER start / stop script I use. This comes with Gentoo. (I think)

/etc/init.d/openser

Leave a Comment

Name:
Location: (city / state / country)
Email: (not published / no spam)

No HTML is allowed. Cookies must be enabled to post. Your comment will appear on this page after a moderator OKs it. Offensive content will not be published.

Click the umbrella to submit your comment.

To create links in comments:

[link:http://www.anders.com/] becomes http://www.anders.com/

[link:http://www.anders.com/|Anders.com] becomes Anders.com

Notice there is no rel="nofollow" in these hrefs. Links in comments will carry page rank from this site so only link to things worthy of people's attention.

About Me:


Name: Anders Brownworth
Location: Research Triangle Park, North Carolina, USA
Work: Head of Research & Development, Bandwidth.com
Play: Technology, World Traveler and Licensed Helicopter Pilot

Contact Me:

Name:
Email:

Click the umbrella to submit. (Why?)

Want to stop form spam on your website? Try JustHumans.com.
user:
pass: