Remove stale resque workers using bash

I run a Rails application that uses Resque to manage background jobs.  It works beautifully.  Unfortunately, sometimes resque workers become stuck while working on a task.  This might happen for a number of reasons, but the end result is that you’re missing a worker and you might not realize it.  

Fortunately, when resque starts a job it forks a child process and places the time it began the job in the process description.  You can tell if your worker is stale by using this timestamp to compare it to the current time.  Epoch Converter is one of a number of websites to help you convert the timestamp manually.

If you’re using god to monitor your processes, resque provides an example to eliminate these “stale” workers.  I don’t use god and instead use monit to monitor my processes.  Resque comes with a monit example too, but it only handles workers that consume too much memory.  If you happen to find yourself with stale workers, try using the following bash script I made.

#!/bin/bash 

# Simple script to kill stale resque workers
# Adam St. John [astjohn] [@] [gmail]
# 2011-06-24

# Timeout set to 10 minutes
TIMEOUT=600

# 21028 resque-1.16.1: Forked 17674 at 1308879414 
output=$(ps -e -o pid,command | grep -e [r]esque.*Forked | awk '{printf "%i#%i#%i\n", $1, $6, $4}')
now=`date +%s`

echo "`date` - Looking for stale workers."
# output should produce something like: 21028#1308879414#17674
if [ -z "$output" ] ; then
  echo "Stale workers were not found."
else
  echo "$output" | while read -r line ; do
  
    pid=$(echo $line | cut -d\# -f1)
    ftime=$(echo $line | cut -d\# -f2)
    fpid=$(echo $line | cut -d\# -f3)

    if ! [[ "$pid" =~ ^[0-9]+$ ]] ; then
      exec >&2; echo "Unable to find PID of stale resque worker"; exit 1
    fi

    if ! [[ "$ftime" =~ ^[0-9]+$ ]] ; then
      exec >&2; echo "Unable to find Forked time for stale resque worker"; exit 1
    fi

    difference=$(($now - $ftime))
    if [ "$difference" -ge "$TIMEOUT" ] ; then
      echo "Found a stale worker!"
      kill -s USR1 $pid
      echo "Kill signal USR1 sent to worker $pid so that it will shut down forked child $fpid"
    fi  
  done
  echo "Finished dealing with stale workers."
fi
exit 0

The script checks the timestamp that each worker leaves and kills a worker that has exceeded the TIMEOUT value. I run this script from a cron job.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>