It's a sudden question, but how do you manage the Jobs that you run on a regular basis?
Even in 2018, when serverless is popular, some people may be using the old-fashioned cron
.
In this article, I would like to suggest a way to replace a server running batch with cron
with zero downtime. Also, I would like to explain why it is possible by glancing at the source code of cron
. (For time reasons, we only handle cronie
on CentOS7
.)
In actual operation, you should design a proper job management system, but this article will introduce a muddy method.
TL; DR
crond
stoppedcrond
on the old server and start crond
on the new server.cron
is a time-based job scheduler used in UNIX-like OS, and is suitable for scheduling regularly executed tasks. It is also used for system management and for executing tasks of actual services.
For example, if you register a job like the following with crontab -e
, the aggregation batch will be executed at 0 minutes every hour.
0 * * * * /bin/bash -l -c '/home/vagrant/bin/execute_hourly_aggregation'
For more information, check out the Wikipedia articles below and the various documents that serve as references.
https://ja.wikipedia.org/wiki/Crontab
In this proposed method, we aim to replace with zero downtime in an environment where some batch is always executed, such as a batch that takes a long time or a large number of batches.
However, please note that the proposed method cannot achieve strict zero downtime on a server where new jobs are kicked every minute.
I will explain in the following environment, which was set up in Vagrant. The version is slightly different from the source code to be explained, but there are no major changes, so I hope you understand the mechanism.
[vagrant@localhost ~]$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[vagrant@localhost ~]$ yum list installed | grep cron
Failed to set locale, defaulting to C
cronie.x86_64 1.4.11-19.el7 @anaconda
cronie-anacron.x86_64 1.4.11-19.el7 @anaconda
crontabs.noarch 1.11-6.20121102git.el7 @anaconda
First, let's prepare a new server. As for the file structure, there is no problem with the same file structure as the old server, but please make it so as to satisfy the following two.
For example, in CentOS7 series, use the service
command to stop it.
[vagrant@localhost ~]$ sudo service crond stop
Redirecting to /bin/systemctl stop crond.service
[vagrant@localhost ~]$ service crond status
Redirecting to /bin/systemctl status crond.service
● crond.service - Command Scheduler
Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2018-09-11 17:44:33 UTC; 11s ago
Process: 3406 ExecStart=/usr/sbin/crond -n $CRONDARGS (code=exited, status=0/SUCCESS)
Main PID: 3406 (code=exited, status=0/SUCCESS)
Allow time for cron jobs to stop kicking.
The important thing here is not when the job is completed, but when the job is no longer kicked.
Let's check the process under crond
with ps aux f
etc. and confirm that the last batch processing on the old server was kicked.
root 3492 0.0 0.3 26096 1704 ? Ss 17:45 0:00 /usr/sbin/crond -n
root 3921 0.0 0.4 82144 2488 ? S 18:05 0:00 \_ /usr/sbin/CROND -n
vagrant 3924 0.0 0.3 12992 1568 ? Ss 18:05 0:00 | \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant 3937 0.0 0.0 7764 352 ? S 18:05 0:00 | | \_ sleep 120
crond
on the old server and start crond
on the new serverFinally, we will move on to the replacement work.
First, stop crond
on the old server and start crond
on the new server.
At this point, the old server will no longer kick new batches.
crond
has stopped, but there are still jobs that crond
kicked. Let's wait for the batch process to complete. At this time, the job kicked by crond
has lost its parent and is now hanging under / usr / sbin / CROND
.
root 4199 0.0 0.4 82144 2488 ? S 18:18 0:00 /usr/sbin/CROND -n
vagrant 4201 0.0 0.3 12992 1564 ? Ss 18:18 0:00 \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant 4214 0.0 0.0 7764 352 ? S 18:18 0:00 | \_ sleep 120
Wait until the last batch process is complete and then stop politely.
Why is this possible?
Let's take a moment to read the cron
code and recheck the state of the process at runtime.
First, cron
loops the following process while
until it receives a SIGINT
or SIGTERM
signal.
while (!got_sigintterm) {
int timeDiff;
enum timejump wakeupKind;
/* ... wait for the time (in minutes) to change ... */
do {
cron_sleep(timeRunning + 1, &database);
set_time(FALSE);
} while (!got_sigintterm && clockTime == timeRunning);
if (got_sigintterm)
break;
timeRunning = clockTime;
// ~Omission~
handle_signals(&database);
}
https://github.com/cronie-crond/cronie/blob/40b7164227a17058afb4f3d837ebb3263943e2e6/src/cron.c#L354-L481
In other words, you can see that the new batch runs every (approximately) every minute this check runs.
The new batch is discovered by find_jobs ()
and runs as a grandchild process of the original crond
viajob_runqueue ()
,do_command ()
,child_process ()
.
root 3492 0.0 0.3 26096 1704 ? Ss 17:45 0:00 /usr/sbin/crond -n
root 3921 0.0 0.4 82144 2488 ? S 18:05 0:00 \_ /usr/sbin/CROND -n
vagrant 3924 0.0 0.3 12992 1568 ? Ss 18:05 0:00 | \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant 3937 0.0 0.0 7764 352 ? S 18:05 0:00 | | \_ sleep 120
So what happens if you stop with SIGTERM
here?
Let's take a look at the code for the last cleaning part.
#if defined WITH_INOTIFY
if (inotify_enabled && (EnableClustering != 1))
set_cron_unwatched(fd);
if (fd >= 0 && close(fd) < 0)
log_it("CRON", pid, "INFO", "Inotify close failed", errno);
#endif
log_it("CRON", pid, "INFO", "Shutting down", 0);
(void) unlink(_PATH_CRON_PID);
return 0;
https://github.com/cronie-crond/cronie/blob/40b7164227a17058afb4f3d837ebb3263943e2e6/src/cron.c#L482-L495
Did you notice? Instead of wait
ing or kill
ing the child or grandchild process, the parent is quietly dead, entrusting the child with the processing of the grandchild.
root 4199 0.0 0.4 82144 2488 ? S 18:18 0:00 /usr/sbin/CROND -n
vagrant 4201 0.0 0.3 12992 1564 ? Ss 18:18 0:00 \_ /bin/bash -l -c echo "start at $(date)" && sleep 120 && echo "end at $(date)"
vagrant 4214 0.0 0.0 7764 352 ? S 18:18 0:00 | \_ sleep 120
As a result, new jobs are not kicked, while jobs that have already been kicked can be completed to the end. (It's a little painful)
Since crond
is implemented very simply, it is possible to replace it with zero downtime even with a crude method such as the proposed method. Of course, in actual operation, it is an area that should be systematized properly, but I hope that you will be interested in the familiar cron
.
Recommended Posts