How I kill a process with suspicious TCP CLOSE_WAIT
During our server-side application development, we encontered a lot of connections are in CLOSEWAIT state, so that our server process is out of file descriptors. We are in the middle of development of a client application that runs in the mobile androids, and the server-side application that runs in a cloud infrastrure.
I'm in the server-side team, and our team is focusing on the development of server-side. Our server-side have multiple front-end server that expose the interface for the clients. Front-end servers are like load-balancers, they dispatch the client requests to the one of the several back-end workers. Since we're in the middle of the development, our front-end servers and back-end servers have a couple of bugs in them. They sometimes made the server crash, even hang unpredictively.
Unfortunately, while we were tring to stablize our server applications, the client team needed a prototype server cluster, so that they can develop their client application and test the interaction between client and the front-end. Personally, I don't want to provide our prototype servers to the client team until the server-side is stablized, but the client team also need to hurry, to meet the dead-line, so we have no choice but to provide them still-unstable-servers.
The biggest problem was, the server application leaves CLOSE_WAIT
state TCP connections on unexpected network situation. So, after a
couple of hours, the server process ran out of file descriptors,
denying client requests. Since we use sophiscated third-party network
library, it would take some times to fix the problem.
So, I need some kind of watchdog, which periodically check whether the
server process leaves CLOSE_WAIT
connections, and kill them, leave
some logs, and so on. Our server application is managed by init(1)
like launcher, so when the server processes are terminated, the
launcher automatically raise them.
Implementation
I was in hurry to implement this wachdog program, so I decided to write small bash script, but later changed to Ruby script. Fortunately, all of our servers already have Ruby 1.8 installed.
At some time slice, the output of the netstat(1)
would like this:
$ netstat -ntp
...
tcp 0 0 10.149.8.221:46271 54.235.151.255:6379 ESTABLISHED 16125/fe-server
tcp 0 0 10.149.8.221:46283 54.235.151.255:6379 ESTABLISHED 16118/fe-server
tcp 0 0 10.149.8.221:46267 54.235.151.255:6379 ESTABLISHED 16120/fe-server
tcp 0 0 10.149.8.221:35250 10.158.95.68:58964 CLOSE_WAIT 16063/fe-server
tcp 0 0 10.149.8.221:43557 10.147.191.96:52421 ESTABLISHED 16063/fe-server
tcp 0 0 10.149.8.221:8010 107.22.161.62:37126 CLOSE_WAIT -
...
$ _
The netstat(1)
from net-tools, accept -n option, indicates to use
numerical addresses and ports, -t options indicates to show only TCP
connections, and -p options to show the related PID and program names.
It looks trival to catch the PID of the process that has one or more
CLOSE_WAIT
connections. One thing to keep in mind is, netstat(1)
sometimes displays "-" in the PID/PROGRAM field. I don't have
enough time when netstat(1)
shows "-", but fortunately, fuser(1)
can identify the owner PID of the connection.
$ fuser -n tcp 8010
35250/tcp: 16063
$ fuser -n tcp 8010 2>/dev/null
16063$_
My first implementation was, just simply count the number of
CLOSE_WAIT
connections per process, and kill(1) $PID
if the
process has more than N CLOSE_WAIT
connections.
The limitation of the first implementation is, it may kill the
process with CLOSE_WAIT
connection that the process just about to
close(2)
it.
So the second implementation work like this:
- save the connection information (source address:port, destination address:port) per process as a set-like container
- Wait for certain amount of the time
- save the connection information again, in another set-like container.
- Get the intersection of the two set.
- If the number of elements in the intersection exceeds N, kill the process.
I couldn't come up with a good implementation of set-like container
in bash(1)
, so I re-implement from the scratch with ruby(1)
.
After few hours, another problem arised. Some server processes,
goes coma, and does not adhere to SIGTERM
. We can only kill them with
SIGKILL
, so I modified the killing line like this:
kill $pid; sleep 2; kill -0 $pid && kill -9 $pid
This line, first send SIGTERM
to the $pid, then sleep for 2
seconds, and if it still can send a signal to the process (in other
words, if the process is still alive), send SIGKILL
to the $pid.
Usage
I named the script, resreap
. The reason was, we call our server
processes as resources, so it stands for 'RESource REAPer'. The
full source code is available from here.
With some extra features, my script, called resreap
, can accept
following options:
$ ./resreap --help
Kill processes that have enough CLOSE_WAIT socket(s)
Usage: resreap [OPTION...]
-f PAT Kill only processes whose command matches PAT
-F HOST:PORT Ignore if foreign endpoint matches to HOST:PORT
HOST should be in IPv4 numerical notation.
-l N If a process has more than or equal to N CLOSE_WAIT
socket(s), it will be killed with a signal
(default: 2)
-i N Set sleep interval between checks in seconds
(default: 2)
-c CMD Before sending a signal, execute CMD in the shell,
If this CMD returns non-zero returns, the process
will not receive any signal.
-s SIG Set the signal name (e.g. TERM) that will be send
to a process (default: TERM)
-w SEC Set the waiting time in seconds between the signal and
SIGKILL (default: 2)
-d dry run, no kill
-D debug mode
-h show this poor help messages and exit
-v show version information and exit
Note that if a process receives the signal, and the process is alive
for 2 second(s), the process will receive SIGKILL.
If you are going to use "-f" option, I recommend to try "-d -D" option
first. If you get the pid of the culprit process, try to get the
command name by "ps -p PID -o command=" where PID is the pid of that
process.
You could send two signal(s) before sending SIGKILL using '-S' option.
This can be useful since some JVM print stacktrace on SIGQUIT.
$ _
For example, if you want to kill a process if it has more than 2
CLOSE_WAIT
connections, and you only care for java program, then you
can do:
$ ./resreap -l 2 -f ^java
Plus, if you want to ignore CLOSE_WAIT
connection on 127.0.0.1:2049,
you could do:
$ ./resreap -F 127.0.0.1:2049
I really hope that we don't need to use this awful script for our servers. :)
Comments
Comments powered by Disqus