Monload2 - Linux load monitor - monload2.tgz

Example context

We have here several linux boxes to be shared among our users (mainly students). Users access them by using X-Windows and run a variety of X software on them. In laboratory lessons these same hosts are used for programming. clients, all served by 5 linux servers.

The trouble

Linux provides us a very stable platform, but we were having reports about systems "hanging" occasionally, they keep running, but become VERY slow. After checking what was happening we detected two causes, both due to bad programming:


Here are the two nasty C code samples:

DO NOT USE THIS CODE, EXCEPT FOR TESTING PROPOSES, THE SECOND WILL REALLY HANG THE SYSTEM, SPECIALLY IF LAUNCHED IN BACKGROUND

The goal for this software is to automatically resolve this problems so no human intervention is required, and if possible avoid the cold reboot. In case of an infinite fork program, probably no one will be able to login, so the human intervention in this cases is just turning the power off and on. If no administrator is around the system may be unusable for hours.


The monitor program - how it works

We first address the high load mad process problem, that was easy, from time to time we run the ps command to check each user process CPU and memory usage, if it is above a predetermined level we register it as mad, if a process keeps mad for a number of iterations we put it on low priority, next we send it the TERM signal, if it still keeps alive we send it the KILL signal. To keep system processes away from this, root processes are ignored, also you may specify in the command line usernames to be ignored.

The second problem was much harder to solve, the infinite fork puts the system on such a load that all io will freeze, using the code that was created to solve the first problem we noticed the ps execution would freeze, so we had to find a way to have a timed out read of the output of the ps command, but nothing seems to work here, for example the select system calls does not work. The ultimate solution was to set an alarm just before the ps execution, if the reading was not done when the alarm was received then the ALARM handler sends a KILL to the ps process. This worked out ok, it seems signals still work ok on these conditions.

Next we need to find a way to shut down the system, just calling shutdown or reboot is no solution, even if it works it will take some hours. First the infinite fork processes must be killed, but you can't get their pids, and even if you could they are always forking, the solution is kind of hard, we first send TERM to all possible pids starting from 100, and then we do the same thing with the KILL signal. This kills all that bad processes (and others), but then we can shut down and reboot the system, if this fails we still try to do a cool reboot.

How to Install

Changing settings in monload2.c:

References to system load mean "average system load in past minute", obtained from /proc/loadavg as shown in top or uptime. ALL SYSTEM LOAD VALUES ARE MULTIPLIED BY 100 (to avoid floating point calculations). Rules to respect in settings (is a must, no internal checking done) :

After changing settings you must recompile, reinstall and restart. Setting these parameters for a particular host must be based on experience, it depends of the purpose of the host. Start by using the values in the original source.

Starting the monitor

To start it on boot, find the appropriate rc file (for example /etc/rc.d/rc.local) and place there the lines for monload2, something like:

echo "Starting monload2"
/usr/local/sbin/monload2 bin nobody wwwrun daemon

Monload puts itself on background and SYSLOG is used to register all activities.