Wednesday, March 27, 2013

SIGALRM Timers and Stdin Analysis

http://www.linuxjournal.com/content/sigalrm-timers-and-stdin-analysis


 It's not hard to create functions to ensure that your script doesn't run forever. But what if you want portions to be timed while others can take as long as they need? Not so fast, Dave explains in his latest Work the Shell.
In an earlier article, I started building out a skeleton script that would have the basic functions needed for any decent shell script you might want to create. I started with command-line argument processing with getopts, then explored syslog and status logging as scripts. Finally, I ended that column by talking about how to capture signals like Ctrl-C and invoke functions that can clean up temp files and so on before actually giving up control of your shell script.
This time, I want to explore a different facet of signal management in a shell script: having built-in timers that let you specify an allowable quantum of time for a specific function or command to complete with explicit consequences if it hangs.
When does a command hang? Often when you're tapping into a network resource. For example, you might have a script that looks up definitions by handing a query to Google via curl. If everything's running fine, it'll complete in a second or two, and you're on your way.
But if the network's off-line or Google's having a problem or any of the million other reasons that a network query can fail, what happens to your script? Does it just hang forever, relying on the curl program to have its own timeout feature? That's not good.

Alarm Timers

One of the most common alarm timer approaches is to give the entire script a specific amount of time within which it has to finish by spawning a subshell that waits that quantum, then kills its parent. Yeah, kinda Oedipal, but at least we're not poking any eyes out in this script!
The additional lines end up looking like this:

(
sleep 600           # if 10 minutes pass
kill -TERM $$       # send it a SIGTERM signal
)&

There's no "trap" involved—easy enough. Notice especially that the closing parenthesis has a trailing ampersand to ensure that the subshell is pushed into the background and runs without blocking the parent script from proceeding.
A smarter, cleaner way to do this would be for the timer child subshell to send the appropriate SIGALRM signal to the parent—a small tweak:

(
sleep 600            # if 10 minutes pass
kill -ALRM $$        # send it a SIGALRM signal
)&

If you do that, however, what do you need in the parent script to capture the SIGALRM? Let's add that, and let's set up a few functions along the way to continue the theme of useful generic additions to your scripts:

function allow_time
{
   ( echo timer allowing $1 seconds for execution
     sleep $1
     kill -ALRM $$
   ) &
}

This first function lets you easily set a time for subsequent execution, while the second presents your ALRM handler in a bit neater fashion:

function timeout_handler
{
   echo allowable time for execution exceeded.
   exit 1
}

Note that both scripts have debugging output that's probably not needed for actual production code. It's easily commented out, but running it as is will help you understand how things interact and work together.
How might this be used? Like this:

trap timeout_handler SIGALRM
allow_time 10
code that has ten seconds to complete

That would give the script ten seconds to finish.
The problem is, what happens if it finishes up in less time than allotted? The subshell is still out there, waiting, and it pushes out the signal to a nonexistent process, causing the following sloppy error message to show up:

sigtest.sh: line 7: kill: (10532) - No such process

There are two ways to fix this, either kill the subshell when the parent shell exits or have the subshell test for the existence of the parent shell just before it sends the signal.
Let's do the latter. It's easier, and having the subshell float around for a few seconds in a sleep is certainly not going to be a waste of computing resources.
The easiest way to test for the existence of a specified process is to use ps and check the return code, like this:

ps $$ >/dev/null ; echo $?

If the process exists, the return code will be 0. If it's gone, the return code will be nonzero. This suggests a simple test:

if [ ! $(ps $$ > /dev/null) ]

But, that won't work because it's the return code, not what's handed to the shell. The solution? Simply invoke the ps command, then have the expression test the return code:

function allow_time
{
   ( echo timer allowing $1 seconds for execution
     sleep $1
     ps $$ > /dev/null
     if [ ! $? ] ; then
       kill -ALRM $$
     fi
   ) &
}

That solves that problem. But, what if you have sections of code where you want to limit your execution time followed by other sections where you don't care?
That's easy if you don't mind leaving some child processes around waiting to shoot a signal at the parent. Just use this:

trap '' SIGALRM

when you're done with the timed passage. What happens is that the timer generates a signal, but the parent script ignores it.
The limitation on this, of course, is if you have code like this:

regular code
possible runaway code <-- 100="" allocate="" cancel="" code="" more="" possible="" regular="" runaway="" seconds="" timer="">

The situation arises if the second code block is started before the first timer runs out. Imagine that you've allocated 100 seconds for the first timed block and it finishes in 90 seconds. Regular code takes five seconds, then you're in block two, for exactly ten seconds. Then the first ALRM timer triggers, after ten seconds rather than another 100. Not good.
This is admittedly a bit of a corner case, but to fix it, let's reverse the decision about having child processes test for the existence of the parent before sending the signal and instead have the parent script kill all child subshells upon completion of the timed portion. It's a bit tricky to build, because it requires the use of ps and picks up more processes than just that subshell, so you not only need to screen out your own process, you also want to get rid of any subshell processes that aren't actually the script itself.
I use the following:

ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$

This generates a list of process IDs (pids) for all the subshells running, which you then can feed to kill:

pids=$(ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$)
kill $pids

The problem is that not all of those processes are still around by the time they're handed to the kill program. The solution? Ignore any errors generated by PID not found:

kill $pids > /dev/null 2>&1

Combined as a function, it'd look like this:

function kill_children
{
   myname=$(basename $0)
   pids=$(ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$)
   kill $pids > /dev/null 2>&1
}

If you're thinking "holy cow, multiple timers in the same script is a bit of a mess", you're right. At the point where you need something of this nature, it's quite possible that a different solution would be a smarter path.
Further, I'm sure there are other ways to address this, in which case I'd be most interested in hearing from readers about whether you've encountered a situation where you need to have multiple timed portions of your code, and if so, how you managed it! Send e-mail via http://www.linuxjournal.com/contact.

No comments:

Post a Comment