Batch Command Relies on Load Being Low Enough to Run
I monitor the activities of a number of servers. They send me a log report daily and various special activities such as backups also send me e-mail after they've finished, showing what was done.
Today a strangeness caught my eye.
/root/bin/single-roller1.sh georgia
mv: cannot stat `5/georgia': No such file or directory
mv: cannot stat `4/georgia': No such file or directory
mv: cannot stat `3/georgia': No such file or directory
mv: cannot stat `2/georgia': No such file or directory
mv: cannot stat `1/georgia': No such file or directory
mv: cannot stat `0/georgia': No such file or directory
job 578 at 2009-03-14 01:28
Each of these subdirectories should have had a copy (actually a bunch of linked entries for files that have not changed as well as real copies of those that have) of the day's backup – and they were empty! We were down to 5 incrementals when we should have had 10 – not good.
I should explain that in this, as with many of my customers, we're keeping on line backups of all critical files. They're not tape, they're real disks. In this case an external E-SATA 1-Tbyte drive.
After each backup is finished a BATCH process is set. This allows the system to queue several sets of copy commands and only execute one at a time so the system isn't overloaded. The process is started by the ATD daemon which is started at boot time.
By default the ATD startup sets the load factor above which no batch processes will be run to 0.8 in many versions of Linux “out of the box”. This is far too low for multi-core machines such as this particular one that has 4 cores (dual Xeon, each of which looks like 2 cores) – so I had set the value to 3, a conservative value of one less than the number of cores.
But this particular machine runs a lot of non-interactive processes; mail relays mostly. The load on it can and does stay high and for the past 5 days it has remained higher than 3 because I found over 40 (using atq) batch processes in the queue waiting to run. These had accumulated over the course of the week, leading to the log e-mail I received showing the results above.
Most of the systems I look after are some form of RedHat Linux – real, development or copy (RH, FC, CentOS) so their setups are all similar. Recently the /etc/init.d/atd script has been changed to take a configuration file in /etc/sysconfig/atd wherein you'll find a tiny bit of info.
# specify additional command line arguments for atd
#
# -l Specifies a limiting load factor, over which batch jobs should not be run, instead of the compile-time
# choice of 0.8. For an SMP system with n CPUs, you will probably want to set this higher than n-1.
#
# -b Specifiy the minimum interval in seconds between the start of two batch jobs (60 default).
#example:
OPTS="-l 5 -b 120"
If this file does not exist you'll probably have to edit the /etc/init.d/atd file directly and add the “-l 5” and the “-b 120” (if you want the batches to run 120 seconds apart – I don't) to the command line that starts the atd command.
The batch command is one of those “forgotten” commands that the move away from the command line has caused, along with its brother at. Essentially the batch command is the at command in disguise. It simply takes the commands you feed it and schedules them “at some time in the future when the load is lower than X (X being the load you set when atd is started with the -l flag) and only run one batch command at a time.”
There are lots of ways to run things “in the background” on Linux. AT and BATCH are two I use all the time for general maintenance and especially backups. In this case my initial setup didn't take into account the “normal” load of the machine (which hovers around the 6+ area most of the time) so things broke. That's why I read the daily reports – every day!

Feed from the Whole Site
What's Related