You can easily make it by reading OCF Resource Agent Developer's Guide, but since there is a lot of text, "something that works for the time being" For people who want to make it.
OCF is an abbreviation for Open Cluster Framework, which defines the interface for clustering applications. Cluster managers such as Pacemaker, which manages clusters, manage managed applications and virtual IP addresses as "resources". Pacemaker commands resources to start, stop, migrate, promote to master, demote to slave, and so on. You can cluster your own apps using Pacemaker by building an OCF-compliant program with the interface between resources and cluster management software such as Pacemaker. The goal of this time is to create this resource agent by myself.
Specifically, OCF-compliant cluster management software kicks the resource agent by putting an action to take in the environment variable $ __ OCF_ACTION
.
Actions include start / stop / migration / promotion to master / demotion to slave, and some actions require definition and some do not (optional).
Executables are kicked when controlling resources. This executable file is called a resource agent and actually manages the operation of resources.
The resource agent looks at the environment variable $ __ OCF_ACTION
and acts to actually perform that action.
If parameters are required for each action, they are passed in an environment variable prefixed with $ OCF_RESKEY
.
As long as the executable file format meets the API requirements of OCF, there are no restrictions on the language, but it seems that it is generally implemented by a shell script.
sample-resource
#!/bin/sh
#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
RUNNING_FILE=/tmp/.running
sample_meta_data() {
cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-resource" version="0.1">
<version>0.1</version>
<longdesc lang="en">sample resource</longdesc>
<shortdesc lang="en">sample resource</shortdesc>
<parameters>
</parameters>
<actions>
<action name="meta-data" timeout="5" />
<action name="start" timeout="5" />
<action name="stop" timeout="5" />
<action name="monitor" timeout="5" />
<action name="validate-all" timeout="5" />
</actions>
</resource-agent>
EOF
return $OCF_SUCCESS
}
sample_validate(){
return $OCF_SUCCESS
}
sample_start(){
touch ${RUNNING_FILE}
return $OCF_SUCCESS
}
sample_stop(){
rm -f ${RUNNING_FILE}
return $OCF_SUCCESS
}
sample_monitor(){
if [ -f ${RUNNING_FILE} ];
then
return $OCF_SUCCESS
fi
return $OCF_NOT_RUNNING
}
sample_usage(){
echo "Test Resource."
return $OCF_SUCCESS
}
# Translate each action into the appropriate function call
case $__OCF_ACTION in
meta-data) sample_meta_data
exit $OCF_SUCCESS
;;
start) sample_start;;
stop) sample_stop;;
monitor) sample_monitor;;
validate-all) sample_validate;;
*) sample_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
It is a magic that is also described in OCF Resource Agent Developer's Guide.
#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
The following are the actions that must be implemented. Other than this, it is treated as an option, so let's define the following actions for the time being.
Appropriate return values must be returned when each action is performed. The return value is [defined] by OCI (https://linux-ha.osdn.jp/wp/archives/4328#3).
Describes the environment variables to use before getting into the concrete definition of each required action.
The process requested from the resource agent is stored and set when the resource agent is called. The resource agent first reads this environment variable and dispatches each action.
Implementation example OCF Resource Agent Developer's Guide
case $__OCF_ACTION in
meta-data) sample_meta_data
exit $OCF_SUCCESS
;;
start) sample_start;;
stop) sample_stop;;
monitor) sample_monitor;;
validate-all) sample_validate;;
*) sample_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
This variable contains the parameters that are set when the resource is created. Although not used in this example, it is used when clustering the TCP server daemon described below.
meta-data Metadata provides basic information about the resource agent, such as the name of the resource agent, the actions it provides, and the parameters it can receive. Written in XML, the resource agent returns metadata on standard output when requested.
start
Implement resource startup process. Specifically, write a daemon startup script.
This time, the actual daemon will not be started, just create a certain file.
Returns $ OCF_SUCCESS
if the startup is successful.
stop Implements resource termination. Describes the shutdown of the daemon. This time, we will delete the file created by start.
If there is no problem with the stop processing, $ OCF_SUCCESS
is returned. (Note that it is not $ OCF_NOT_RUNNING
)
Note that the stop action means "forced stop" of the resource. Even if the resource cannot be safely stopped, it is an action to stop it anyway. If the stop action fails, the cluster manager may fencing the node (isolation, such as by forced shutdown), as it can cause fatal problems. The stop action should use every possible means to stop a resource and only return an error code if the stop still fails.
monitor
Implement the process to get the status of the resource.
If it is running, it returns $ OCF_SUCCESS
, and if it is not running, it returns $ OCF_NOT_RUNNING
.
If there is any error, it returns the appropriate error constant starting with $ OCF_ERR_
depending on the error content.
validate-all Validate the resource settings. Check that the parameters are set correctly, and that the permissions of the files used by the resource are appropriate. The return value must be one of the following:
Return value | meaning |
---|---|
$OCF_SUCCESS | no problem |
$OCF_ERR_CONFIGURED | There is a problem with the settings |
$OCF_ERR_INSTALLED | Required components do not exist (for example, the daemon to be started is not installed) |
$OCF_ERR_PERM | There is a problem with the access authority of the file required for resource management |
(This time it's easy, so it always returns $ OCF_SUCCESS
.)
You can test with ocf-tester.
#ocf-tester -n [Resource name] [Resource agency path]
salacia@ha1:~/ocf-scr$ sudo ocf-tester -n sample-resource ./sample-resource
Beginning tests for ./sample-resource...
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
./sample-resource passed all tests
The location of Pacemaker's resource agent is under /usr/lib/ocf/resource.d/
.
Create a directory with the provider name here, and place the created resource agent in it.
Since my handle name is kamaboko, the provider name is kamaboko and the above sample-resource is placed. (Resource agents must be deployed on all nodes of the cluster)
salacia@ha1:~/ocf-scr$ ls -al /usr/lib/ocf/resource.d/kamaboko/
total 16
drwxrwxr-x 2 root root 4096 Aug 11 14:07 .
drwxr-xr-x 6 root root 4096 Jun 21 03:36 ..
-rwxr-xr-x 1 root root 1547 Aug 11 14:07 sample-resource
-rwxrwxr-x 1 root root 2103 Jun 21 03:36 sample-tcp-server
Now that the resource agent is available from Pacemaker, let's create a resource from the pcs command.
salacia@ha1:~/ocf-scr$ sudo pcs resource create SAMPLE ocf:kamaboko:sample-resource
salacia@ha1:~$ sudo pcs status
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:15:47 2020
Last change: Tue Aug 11 14:15:45 2020 by root via cibadmin on ha1
2 nodes configured
1 resource configured
Online: [ ha1 ha2 ]
Full list of resources:
SAMPLE (ocf::kamaboko:sample-resource): Started ha1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
I was able to run my own resource agent in Pacemaker.
Here is the sample code. OCF Resource Agent Samle
For the time being, only the .deb package is supported, so I think it can work on Debian, Ubuntu, etc. (Development environment is Ubuntu 18.04) Pacemaker is a prerequisite that has already been installed.
I just made an appropriate daemon, so I made an app that just greets me when I connect via TCP. (Don't mention the code itself as it's just a sample ...) When you start the application and connect with telnet etc., the greeting text set in the environment variable and your own node name are returned.
daemon/tcp-server.py.py
#!/usr/bin/python3
import os
import socket
import threading
import time
import signal
import sys
PORT = 5678
PID_FILE_DIR = "/var/run/sample-tcp-server"
PID_FILE_NAME = "tcp-server.pid"
PID_FILE = "%s/%s" % (PID_FILE_DIR, PID_FILE_NAME)
EXIT = False
GREET = os.environ.get("GREET", "Hello!")
def signal_handler(signum, stack):
EXIT = True
def server():
os.makedirs(PID_FILE_DIR, exist_ok=True)
if os.path.isfile(PID_FILE):
raise Exception("Already running")
with open(PID_FILE, "w") as f:
f.write(str(os.getpid()))
print("Create Socket")
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('', PORT))
s.listen(5)
try:
while True:
if EXIT:
raise Exception("Stop daemon due to receive signal")
(con, addr) = s.accept()
t = threading.Thread(
target=handler,
args=(con, addr),
daemon=False
)
t.start()
except Exception as e:
sys.stderr.write("%s\n" % e)
finally:
print("Close Socket")
s.close()
os.remove(PID_FILE)
return
def handler(con, addr):
con.send(("%s This is %s!\n" % (GREET, socket.gethostname())).encode())
con.close()
if __name__ == '__main__':
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
server()
Since it was made only with standard modules, there is no need to install libraries. I am trying to generate a PID file to confirm the start of the process. Since the PID file is deleted when the application is closed, you can check the startup of the application by the existence of the PID file, but if it is dropped by SIGKILL etc., it will not be deleted, so I think it is not very good. (This time it's just a test, so I'm doing this for simplicity)
Start-up
GREET=Hii! python3 tcp-server.py
Try connecting with telnet
salacia@ha1:~$ telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Hi!, This is ha1!
Connection closed by foreign host.
The sample code includes a service file so that you can start and stop this app with systemd. This time, let's run this app in a high availability cluster to make it redundant.
You can download and create a package with make, so all you have to do is install it from dpkg. (Must be installed on all nodes of the cluster)
git clone https://github.com/kamaboko123/OCF_resource_agent_sample.git
cd OCF_resource_agent_sample
make
sudo dpkg -i dist/sampletcpserver_1.0_amd64.deb
For the time being, set VIP (VRRP) between the two nodes and fail over in the event of a failure.
#Register VIP as a service
sudo pcs resource create VIP ocf:heartbeat:IPaddr2 ip=172.16.0.50 cidr_netmask=24 op monitor interval=10s on-fail="standby"
#Register the sample application as a service
sudo pcs resource create TCP-SERVER ocf:kamaboko:sample-tcp-server greet=Hi!
#Set constraints so that the VIP and ACTIVE nodes of the sample app are the same
sudo pcs constraint colocation add TCP-SERVER with VIP INFINITY
Connect to the virtual IP address by telnet from an external node.
salacia@Vega:~$ telnet 172.16.0.50 5678
Trying 172.16.0.50...
Connected to 172.16.0.50.
Escape character is '^]'.
Hi!, This is ha1!
Connection closed by foreign host.
Shut down the connected node to fail over and verify that the service is still available.
#Currently VIP and TCP-Node on which the SERVER resource is running(ha1)Drop
salacia@ha1:~$ sudo pcs status
[sudo] password for salacia:
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:31:30 2020
Last change: Tue Aug 11 14:17:26 2020 by root via cibadmin on ha1
2 nodes configured
2 resources configured
Online: [ ha1 ha2 ]
Full list of resources:
VIP (ocf::heartbeat:IPaddr2): Started ha1
TCP-SERVER (ocf::kamaboko:sample-tcp-server): Started ha1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
salacia@ha1:~$ sudo shutdown -h now
Connection to 172.16.0.51 closed by remote host.
Connection to 172.16.0.51 closed.
#Check if the service continues to be provided from the external node to the VIP
salacia@Vega:~$ telnet 172.16.0.50 5678
Trying 172.16.0.50...
Connected to 172.16.0.50.
Escape character is '^]'.
Hi!, This is ha2!
Connection closed by foreign host.
#Failover destination node(ha2)Check the status with
salacia@ha2:~$ sudo pcs status
[sudo] password for salacia:
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:35:51 2020
Last change: Tue Aug 11 14:17:26 2020 by root via cibadmin on ha1
2 nodes configured
2 resources configured
Online: [ ha2 ]
OFFLINE: [ ha1 ]
Full list of resources:
VIP (ocf::heartbeat:IPaddr2): Started ha2
TCP-SERVER (ocf::kamaboko:sample-tcp-server): Started ha2
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
By the way, when testing ocf-tester, it is OK if you specify the full path.
sudo ocf-tester -n sample-tcp-server /usr/lib/ocf/resource.d/kamaboko/sample-tcp-server
As explained earlier, the resource agent is written in a shell script.
/usr/lib/ocf/resource.d/kamaboko/sample-tcp-server
#!/bin/sh
#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
#default value
OCF_RESKEY_greet_default="Hello!"
: ${OCF_RESKEY_greet=${OCF_RESKEY_greet_default}}
#environment variables for systemd
DAEMON_PID_FILE=/var/run/sample-tcp-server/tcp-server.pid
sample_meta_data() {
cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-tcp-server" version="0.1">
<version>0.1</version>
<longdesc lang="en">sample tcp server</longdesc>
<shortdesc lang="en">sample tcp server</shortdesc>
<parameters>
<parameter name="greet" unique="0" required="0">
<longdesc lang="en">greet message</longdesc>
<shortdesc lang="en">greet message</shortdesc>
<content type="string"/>
</parameter>
</parameters>
<actions>
<action name="meta-data" timeout="5" />
<action name="start" timeout="5" />
<action name="stop" timeout="5" />
<action name="monitor" timeout="5" />
<action name="validate-all" timeout="5" />
</actions>
</resource-agent>
EOF
return $OCF_SUCCESS
}
sample_validate(){
return $OCF_SUCCESS
}
sample_start(){
mkdir -p /var/run/sample-tcp-server
echo "GREET=${OCF_RESKEY_greet}" > /var/run/sample-tcp-server/env
systemctl start sample-tcp-server
sleep 1
return $OCF_SUCCESS
}
sample_stop(){
systemctl stop sample-tcp-server
return $OCF_SUCCESS
}
sample_monitor(){
if [ -f ${DAEMON_PID_FILE} ];
then
return $OCF_SUCCESS
fi
return $OCF_NOT_RUNNING
}
sample_usage(){
echo "Test Resource."
return $OCF_SUCCESS
}
# Translate each action into the appropriate function call
case $__OCF_ACTION in
meta-data) sample_meta_data
exit $OCF_SUCCESS
;;
start) sample_start;;
stop) sample_stop;;
monitor) sample_monitor;;
validate-all) sample_validate;;
*) sample_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
In meta-data, parameters are defined in addition to the required items.
The parameter is a value set when the resource is created, and can be obtained by the variable ʻOCF_RESKEY_parameter namein the resource agent. This time, the text of the greeting is defined by the parameter name
greet. In the case of a required parameter, the
requiredattribute is set to 1, but this time it is 0, so it is treated as an option. Therefore, it also includes a definition of the default value if not specified. (If not specified, this default value
Hello!Will be in
$ OCF_RESKEY_greet`.)
#default value
OCF_RESKEY_greet_default="Hello!"
: ${OCF_RESKEY_greet=${OCF_RESKEY_greet_default}}
sample_meta_data() {
cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-tcp-server" version="0.1">
<version>0.1</version>
<longdesc lang="en">sample tcp server</longdesc>
<shortdesc lang="en">sample tcp server</shortdesc>
<parameters>
<parameter name="greet" unique="0" required="0">
<longdesc lang="en">greet message</longdesc>
<shortdesc lang="en">greet message</shortdesc>
<content type="string"/>
</parameter>
</parameters>
<actions>
<action name="meta-data" timeout="5" />
<action name="start" timeout="5" />
<action name="stop" timeout="5" />
<action name="monitor" timeout="5" />
<action name="validate-all" timeout="5" />
</actions>
</resource-agent>
EOF
return $OCF_SUCCESS
}
start
I'm just launching service with systemd.
The service file will be explained later.
Since the parameters from OCF are passed as environment variables when service is started, $ {OCF_RESKEY_greet}
is written to the file.
sample_start(){
mkdir -p /var/run/sample-tcp-server
echo "GREET=${OCF_RESKEY_greet}" > /var/run/sample-tcp-server/env
systemctl start sample-tcp-server
sleep 1
return $OCF_SUCCESS
}
stop There is no particular explanation and the service is stopped.
sample_stop(){
systemctl stop sample-tcp-server
return $OCF_SUCCESS
}
monitor This time I'm looking at the PID file.
sample_monitor(){
if [ -f ${DAEMON_PID_FILE} ];
then
return $OCF_SUCCESS
fi
return $OCF_NOT_RUNNING
}
I've done this for simplicity, but I don't think it's really a good implementation. It is implemented to delete the PID file at the end of the process, but if it is killed by SIGKILL, the PID file will not be deleted. Depending on the creation of the service file, the stop action is considered to stop the resource by all means, so SIGKILL may be issued in the end. In that case, the PID file will continue to remain even if it is stopped, and there is a possibility that the actual state will differ from the state confirmed by the monitor action. (Since I'm using systemd for service management, I should have done it via systemd)
validate-all I don't do anything in particular.
sample_validate(){
return $OCF_SUCCESS
}
Nothing special is done, just start the daemon while reading the environment variables from the environment variable file created when setting the resource.
/lib/systemd/system/sample-tcp-server.service
[Unit]
Description=Sample TCP Server
[Service]
Type=simple
ExecStartPre=/bin/touch /var/run/sample-tcp-server/env
EnvironmentFile=/var/run/sample-tcp-server/env
ExecStart=/usr/bin/tcp-server.py
ExecStop=/usr/bin/pkill -F /var/run/sample-tcp-server/tcp-server.pid
[Install]
WantedBy=multi-user.target
OCF resource agents are surprisingly easy to create. This time, as an entrance, we first created a resource agent that actually works with only the required actions, but there are some detailed notes when actually creating it. There are various hints in the OCF Resource Agent Developer's Guide, so I think it will be helpful.
LPIC304 When I was studying, I didn't really understand what Pacemaker was doing, and while researching various things, I realized that I could make my own resource agent, so this article was born.
Recommended Posts