What Every C Programmer Should Know About Undefined behavior

Seong-Kook Shin

2018-05-28 19:51

Comments

These are links to the three articles of Chris Lattner, posted in LLVM blog:

With Chris Lattner's permission, I translate these wonderful articles in Korean here.

Starting blog again with Nikola

Seong-Kook Shin

2018-03-03 14:06

Comments

More than 2 years, I haven't updated my site. To say long story short, I was in UK for more than one years, and unable to set up my desktop machine, which has all the sources of my site, left in a luggage during in UK. Recently, I moved to the Bay area in US, and finally set up the old desktop again, and replace the site generator from Jekyll to Nikola.

Jekyll was great, however, it requires too many modification for my need, since my site requires some static pages plus blog entries, and I wrote all my posts in Org-mode, which was not supported by Jekyll so I had to come up with my own solution that convert .org files to .html files using Emacs myself, and insert YAML configuration to all of them, and uses Jekyll to generate the final output.

Now, I read a very interesting article, Blogging in org-mode with Nikola, which introduced Nikola to me, and I decided to experiment with it. After few hours of experimentation, I was really surprised as it contains everything that I imagined before!

org mode support using Emacs instead of custom converter.
RSS feed by default
support for non-blog site
Disqus integration.
support for multi-lingual site

I dare to say, (for my own need at least), Nikola is better than vanilla Jekyll, and still better than Octopress.

Of course, there are some problems that I found already, but they are minor enough. Hopefully, I may explain those in another post.

Anyway, here I am, restarted the blog again. Hope that I can find many interesting topics soon.

Remove a node from a linked list in almost O(1)

Seong-Kook Shin

2014-08-12 00:00

Comments

struct node {
  int data;
  /* ... */
  struct node *next;
};
typedef struct node NODE;

Suppose that we have a singular list of NODE or a circular list of NODE⁠. To remove a node from the list, we'd like to implement following function:

/* Remove NPTR from the list pointed by HEAD, and return it. */
NODE *list_extract(NODE **head, NODE *nptr);

The node that we need to extract from the list is pointed by nptr⁠. So we need to make sure that the next field of the node before nptr should point the node after nptr⁠. One way to find the previous node is to iterate all nodes from head to find the node whose next field is the node pointed by nptr⁠:

NODE *before;
for (before = head; before != 0; before = before->next)
  if (before->next == nptr)
    break;

However, it may not be so efficient to traverse lots of node before nptr if the list is very very long. There is another way to extract a node from a list without iterating the list to find the previous node.

head           nptr
[10] -> ... -> A:[12] -> B:[34] -> ...

Suppose we have a list like the above. head and nptr is the pointer to the head of the node and the node that needs to be extracted respectively. Let the node points by nptr is node A, and the next node of node A is node B. Node A has an integral data, 12, and node =⁠B= has 34⁠.

Instead of finding the node before nptr⁠, we like to swap the contents of the node A and node B. To do that, I'll introduce a temporary node, T⁠, and copy the contents of node =⁠A= into it.

head           nptr
[10] -> ... -> A:[12] -> B:[34] -> ...

                           ^
                           |
               T:[12] -----+

Note that we copied every fields of node A into node T⁠, so =⁠next= field of node T also points the same node as node A points. Then, we copy all fields of node B into node A:

head           nptr
[10] -> ... -> A:[34] -------------+
                                   |
                                   V
               T:[12] -> B:[34] -> ...

Then, we copy all fields of node T into node B⁠:

head           nptr
[10] -> ... -> A:[34] -----------> ...

                            +----+
                            |    |
                            V    |
               T:[12] -> B:[12] -+

Since we copyied next field from node =⁠T= to node B⁠, node =⁠B= turns to be a circular list of its own node, which we don't intent. So set next field of node B to NULL⁠.

head           nptr
[10] -> ... -> A:[34] -----------> ...

               T:[12] -> B:[12] -|

Now the node that we wanted to extracted is copyied to node B, which is accessible by next field of node T. And the list has successfully removed the contents of node pointed by nptr⁠.

However, when the target nptr is the last node, we cannot use this trick since there is no next node. In that case, we use the well known solution, to iterate all nodes to find the previous node.

The full source code of list_extract() is here:

NODE *
list_extract(NODE **head, NODE *nptr)
{
  NODE t;
  NODE *prev;

  if (*head == nptr) {
    if (nptr->next == NULL) {
      *head = 0;
      return nptr;
    }
    else {
      *head = nptr->next;
      nptr->next = NULL;
      return nptr;
    }
  }

  if (nptr->next) {
    memcpy(&t, nptr, sizeof(t));
    memcpy(nptr, nptr->next, sizeof(t));
    memcpy(t.next, &t, sizeof(t));
    t.next->next = 0;
    return t.next;
  }

  for (prev = *head; prev != NULL; prev = prev->next) {
    if (prev->next == nptr)
      break;
  }
  if (!prev)
    return 0;

  prev->next = nptr->next;
  nptr->next = 0;

  return nptr;
}

And the time complexity will be:

\begin{eqnarray} O(f(n)) & = & \left\{ \begin{array}{l l} O(1) & \quad \text{ if $nptr$ is the head} \\ O(n - 1) & \quad \text{ if $nptr$ is the last} \\ O(1) & \quad \text{ the rest } \\ \end{array} \right. \\ &=& \frac{1}{n}O(1) + \frac{1}{n}O(n-1) + \frac{n - 2}{n}O(1) \\ &=& \frac{n - 1}{n}O(1) + \frac{1}{n}O(n) \\ &\cong& O(1) \quad \text{ if $n$ is relatively large } \end{eqnarray}

Detect staled NFS mount

Seong-Kook Shin

2014-07-02 00:00

Comments

Check stale NFS

Here's a simple script to check whether the given directory (NFS mount point) is stale.

There are three points that needs some explanation here.

First, since any command that access the NFS file system would block (unresponsive) iff the NFS is stale, I am using read -t N for the timeout.

Second, I used process substitution feature of bash, <(list) form. Basically, read -t 1 < <(...) will timeout after 1 second unless =⁠…= part finished within the timeout. bash will create a new subshell to execute /⁠list/ in <(list) form. The problem is, if any of list will access the stale NFS, the command will hang, which makes the subshell also will hang. Even if the calling shell script finished, the subshell would not terminated, leaving the process in interruptible sleep state.

To prevent this, I recorded the sub-shell PID using $BASHPID in <(list) form. After =⁠read= command, I deleted all possible children of the subshell and the subshell itself. Note that using $$ won't work in <(list) form. =⁠$$= represents the mother shell's PID, not the sub-shell's.

SSH wrapper to connect Scalr-managed servers

Seong-Kook Shin

2014-02-13 00:00

Comments

It's annoying to connect one of the virtual machine managed by Scalr for various reasons. So I created small shell script for easy connection; browse the servers with its IP address, then connect to them.

Backgrounds

We created lots of Scalr managed servers. Some of them have external IP addresses, but some of them are not. So, we need to prepare a proxy machine, to connect those servers.

<pre>

Internet Firewall Cloud IaaS

~~---------~~ ~~----~~-—+ ~~---------~~

Client		Proxy		Target
Machine		Server		Server
	----–—>		---–—>

~~---------~~ ~~----~~-—+ ~~---------~~ 53.208.160.176 10.102.9.203

</pre>

For example, suppose that we've prepared the proxy machine at 53.208.160.176. Normally, you could connect to the destination server 10.102.9.203 by issuing ssh twice, like this (which is annoying):

$ ssh 53.208.160.176
$ ssh 10.102.9.203

We are using Scalr's auto-scaling feature; this means, the number of servers are dynamically increasing/decreasing depending on the server's load. In other words, at some instance, we do not exactly know how many servers are there, and we do not exactly know what IP addresses they have.

So, I created small shell script named sssh (stands for "Scalr-ssh") to find out the list of Scalr-managed servers, and provide easy ssh connection to one of the servers. With this, you can connect a server instantly even if the server does not have external IP address.

Download & Installation

First, you'll need to download the Scalr command line tool from Scalr Command Line Tools, and you'll need to finish scalr configure step.

$ sudo easy_install scalr
$ scalr configure -i d41d8cd98f00b204 \
    -a 3bEGXWzaoT92BMhOaqv13bEGXWzaoT92BMhOaqv13bEGXWzaoT92BMhOaqv1+0=

Above example will save your configuration in $HOME/.scalr/config.ini.

Then, you'll need to grab the source from here, and save it to some place like /usr/local/bin. Then, edit that file to update the proxy endpoint in SSH_PROXY_ENDPOINT to reflect your proxy endpoint. For example:

SSH_PROXY_ENDPOINT=${SSH_PROXY_ENDPOINT:="root@53.208.160.176}

You can test whether the installation was successful via following command. Note that the actual output may vary depending on your Scalr configuration/usage.

$ sssh env
  [149]  AWS-TEST-ENV
  [158]  US-EAST-9
  [161]  AP-KR-FOOBAR

Usage

If you have more than one Scalr environment, you'll need to list the environments using sssh env, then select one of the environment with the following command:

$ # select environment with id, 158
$ sssh set-env 158

Then, you'll need to select one of your farms. First, list the farms using sssh farms, then select one of it using sssh set-farm:

$ # list the farms
$ sssh farms
  [808]                          test-vpc (Stopped):   VPC farm for testing
  [809]      ec2-us-east-1-management-dev (Running):   None
  [814]           ec2-us-east-2-store-dev (Stopped):   None
  [953]                template-test-farm (Running):   None
$ # select one of the farm
$ sssh set-farm 809

Once the env/farm is selected, then you can browse the list of servers by sssh list:

$ sssh list
  [ 0]   53.208.160.176    10.102.9.174  proxy-server
  [ 1]      53.84.9.110    10.102.9.135  zookeeper-3-centos6-cl
  [ 2]     53.84.76.146    10.102.9.146  zookeeper-3-centos6-cl
  [ 3]     53.84.65.212      10.102.9.7  zookeeper-3-centos6-cl
  [ 4]             None    10.102.9.203  sessionmgr-master-centos6-cl
  [ 5]     53.84.72.223    10.102.9.132  cs-sessionmgr-master-centos6-cl
  [ 6]     53.84.74.122     10.102.9.52  cs-sessionmgr-master-centos6-cl
  [ 7]     53.84.64.155    10.102.9.112  cs-frontend-centos6-cl
  [ 8]       53.84.0.88    10.102.9.106  cs-frontend-centos6-cl
  [ 9]             None    10.102.3.210  cs-datastore-centos6-cl

Each item contains 4 fields: the server index, the external IP, the internal IP, and the name of the server. In above example, 4th and 9th server do not have external IP. Remember that we configured SSH_PROXY_ENDPOINT to point 0-th server endpoint, "root@53.208.160.176". This server is used for the ssh proxy for this demonstration.

These servers belong to the farm id, 809 as we selected this farm using sssh set-farm 809. To connect one of these servers, you need to download the PEM file of this farm, and place it in your $HOME/.ssh/809.pem. Finally, you can connect to one of the servers by following command:

$ # connect to 1st server
$ sssh connect 1
Last login: Tue Feb 11 05:32:28 2014 from 124.168.108.138

Appliance:	centos-6-scalr appliance 1.0
Hostname:	ip-10-102-9-135
IP Address:	10.102.9.135

[root@ec2-53-84-9-110 ~]# _

You can even connect to the server without external IP. For example:

$ # connect to 9th server
$ sssh connect 9
Last login: Wed Feb 12 09:04:02 2014 from 10.102.9.174

Appliance:	centos-6-scalr appliance 1.0
Hostname:	ip-10-102-9-210
IP Address:	10.102.9.210

[root@ip-10-101-3-210 ~]# _

Note that from the first message of the command; it says that the connection was from 10.102.9.174, which is the internal IP address of the 0-th server, which is used for the ssh proxy.

Internally, when we specify a server without external IP address, sssh will indirectly connect to the server via the pre-configured ssh proxy server using ssh ProxyCommand option with netcat(1):

ssh -i "$pem" -o "ProxyCommand ssh -i $pem $SSH_PROXY_ENDPOINT nc %h %p" root@${iip}

If you have interest on this, read the nice article Configure openssh to tunnel through another server using SSH.

Easy way to resize/select windows or frames of Emacs

Seong-Kook Shin

2013-11-30 00:00

Comments

Window/Frame Selection

When you're using Emacs in a graphic display (That is, not in a console nor a terminal), you'll have multiple Emacs frames and windows in them.

The default navigation interface of Emacs is quite surprising to the non Emacs users since Emacs provides relative selection machanism.

For example, suppose you have following Emacs frames and windows:

<pre>
+-Frame A------------+   +-Frame B------------+   +-Frame C------------+
|                    |   |                    |   |                    |
| Window A           |   | Window C           |   | Window F           |
|                    |   |                    |   |                    |
|                    |   +--------------------+   |                    |
|                    |   |                    |   |                    |
+--------------------+   | Window D           |   |                    |
|                    |   |                    |   |                    |
| Window B           |   |                    |   |                    |
|                    |   +--------------------+   |                    |
|                    |   |                    |   |                    |
|                    |   | Window E           |   |                    |
+--------------------+   +--------------------+   +--------------------+
</pre>

Emacs provides basic frame selection and window selection commands; other-frame and other-window. They select the next frame or next window from the list in cyclic order. The problem is, the default order may not reflect the coordinates of the frame/window, especially when you moved some frames/windows.

Suppose that the currently selected frame is Frame B in above figure. If Emacs kept the frame list in (Frame#B Frame#A Frame#C), the next frame would be Frame A. Of course, by using negative prefix argument to other-frame function, you can select the previous frame if you want.

What I want is, to select a frame in the order of the actual coordinate of the frames. That is, I want to give a command something like, "select the frame where its X coordinate is the closest to the origin.", or "select the frame where its X coordinate is the second closest to the origin.".

So I made a simple function, wfu/frame-list-ordered, to return an ordered list of frames. Similarly, I made another function, wfu/window-list-ordered, to return an ordered list of windows. Using these two functions, I made two commands; wfu/other-frame-or-window and wfu/other-window-or-frame. wfu/other-frame-or-window will select other(next) frame. If there is no other frame, it will select other(next) window. Similarly, wfu/other-window-or-frame will select other(next) window. If there is no other window, it will select other(next) frame.

Binding a key shortcut to an Emacs command is treaky thing. Since you don't know that whether you can easily memorize new keybindings. Anyway, I found following key sequences are best for my personal use:

Keys	Description
`C-<tab>`	bound to `wfu/other-window-or-frame`, it selects the next window of the current frame.
`C-N C-<tab>`	Select the N-th window of the current frame.
`C-- C-N C-<tab>`	Select the N-th frame.
`C-x o`	bound to `wfu/other-frame-or-window`, it selects the next frame.

N is the number between 0 and 9. Note first window/frame starts from 1, not 0.

Normally, I just stick to C-<tab> to select other window. When I want to select other frame, I'll feed it a negative number; where the absolute value of the number denotes the N-th frame.

Source

You may download wfutils.el from the github.

A script to create Redis Cluster using GNU screen(1)

Seong-Kook Shin

2013-11-22 00:00

Comments

Easy way to setup the local redis replication using GNU screen(1)

One of my reponsibility is to write easy client library for Redis in C and Java. There are already well-known client C library, hiredis and Java library, jedis. At the time of development, none of these support our Redis replication cluster. I'll write later about the client libraries that support replication.

Anyway, during the development, I need to launch simple redis cluster, which consists of 1 master and 2 slaves. It is tiresome job to setup the configuration of master and slaves, and it is very likely to commit a mistake.

Thus, I wrote a small shell script (called redis-replica.sh) to launch redis cluster locally. Internally, it uses GNU screen to create multiple shell to launch required processes:

shell no.	description
0	redis-sentinel, listening 26379
1	redis monitor to the sentinel
2	redis-cli to the sentinel
3	redis-server master, listening 6379
4	redis monitor to the master
5	redis-cli to the master
6	redis-server slave#1, listening 6380
7	redis monitor to the slave#1
8	redis-cli to the slave#1
9	redis-server slave#2, listening 6381
10	redis monitor to the slave#2
11	redis-cli to the slave#2

Since the master and slaves are managed by the sentinel process, if you shutdown the master, one of the slaves will be promoted to new master.

This way, you can easily experiment and test your client software or libraries.

Here is the source code of redis-replica.sh:

How I measured Emacs init script performance

Seong-Kook Shin

2013-11-21 00:00

Comments

Background

How did I load lots of init scripts?

If you're like me, you have a lot of elisp files for your Emacs configuration. Without considering unmaintained code or junk codes, I have almost 80 elisp files in my $HOME/.emacs.d/. This causes Emacs launching slower and slower. Normally, I don't turn-off my computer, nor I need to launch Emacs frequently. So it was not big deal.

However, sometimes it took more than 7 seconds to launch Emacs on my idle Gentoo machine. Why it took so much time before starting? Which file is the time-consuming monster? Unfortunately, I couldn't answer.

So I tried to clean-up my configurations. First, I remove all unused junk from my $HOME/.emacs.d/init.el. Even after that, my init file was too big to maintain easily. So I modulize the init file into separate code pieces. Now, there is just a small init.el, and lots of code pieces reside in $HOME/.emacs.d/init/. For example, my customization for dired package is stored in $HOME/.emacs.d/init/dired.el, and Korean language customization is stored in $HOME/.emacs.d/init/dired.el, and so on.

After that, I wrote simple macro, which loads the init code pieces from the specified directory, $HOME/.emacs.d/init/, if it was not loaded before, much like Emacs's require function.

(uinit/load "korean" enable-multibyte-characters)

You can interpret above code as "If enable-multibyte-characters is not nil, load $HOME/.emacs.d/init/korean.el, if it was not loaded before."

In detail, uinit/load function will search whether the given code piece (e.g. Korean.el in above) is loaded before, by searching the list uinit/loaded-init-files. Then if it does not exists, it call call load function to load the file. It also record the duration time of loading the piece, and save it in to the report buffer, "*uinit*".

After my init.el is loaded by Emacs, it will repeatedly call uinit/load to load lots of init code pieces, saving the timing of loading in "*uinit*" buffer. In the end of init.el file, I call uinit/summarize so that it will sort *uinit* buffer by the consumed time, and wrote the total amount of time to load my init code pieces. Here's the screenshot of *uinit* buffer:

nil

Usage

You can grab the source of uinit package from here.

Your emacs init script ($HOME/.emacs or $HOME/.emacs.d/init.el) modifies load-path, You must set load-path before using any of function in uinit because uinit will try to byte compile your init pieces on load. Otherwise byte compilation will fail. Especially, if you're using package, try to call (package-initialize) before loading uinit. For example:

(when (locate-library "package")
  (require 'package)
  (package-initialize))

(setq uinit/init-directory "~/.emacs.d/init")
(require 'uinit)

Then, you're free to call uinit/load to load your init code pieces. (Your init code pieces should be placed in uinit/init-directory.) For example:

(uinit/load "darwin" (eq system-type 'darwin))
(uinit/load "X" (eq window-system 'x))

Finally, add following code in the end of your init.

(uinit/summarize)

A dirty macro that convert the scala type to string type in binary representation

Seong-Kook Shin

2013-09-26 00:00

Comments

Handy function that convert the scala type to string type in C. This is a not-portable GCC macro to produce binary-represented string in C or C++.

Sometimes, it is useful to have a way to print a scala value in a binary representation. Unfortunately, there is no such format specifier for printf(3). Luckily, glibc has a customization feature for printf(3)-like functions. If you're interested in, read the code here. For example, you could write:

int i = 0xdeadbeef;
printf("%b\n", i);      /* print 11011110101011011011111011101111 */

If you're not using glibc, you could create some utility functions to do this. Here're some candidate function prototypes:

/* return the binary represented string of VALUE.  The width
 * of the string is CHAR_BIT * size.  The return value should
 * be free(3)ed after use. */
char *binstr(unsigned long long value, size_t size);

/* store the binary representation of VALUE in string pointed by BUF
 * with the size SIZE.  */
char *binstr(unsigned long long value, char *buf, size_t size);

I'm not satisfied with these. The first one, which returns a value that is dynamically allocated, looks heavy. Plus, the caller should supply the number of binary digit, since it always accept the value as unsigned long long. The second one, which works only when the user provided the buffer as in the second and the third parameters. It does not look heavy, but probably someone may be not happy with it, since the user always need to prepare the buffer.

I couldn't come up with a everybody-satisified solution with this.

Since most of my works are done in Linux and MacOS; both provides GCC, so I decided to drop the portability. :)

#define binstr(x)       ({ char *p = alloca(sizeof(x) * CHAR_BIT + 1);  \
      char *q = p + sizeof(x) * CHAR_BIT;                               \
      *q-- = '\0';                                                      \
      int i;                                                            \
      typeof(x) tmp = (x);                                              \
      for (i = 0; i < sizeof(x) * CHAR_BIT; i++) {                      \
	*q-- = (tmp & 1) ? '1' : '0';                                   \
	tmp >>= 1;                                                      \
      }                                                                 \
      q + 1;                                                            \
    })

I'm using alloca(3), instead of malloc(3), so that the memory buffer should be freed automatically. The caller no longer need to call free(3).
To decide the right number of digits of the value type, binstr is a macro function.
I used statement expression, ({…}) (GCC extension), which allows me to declare local variables.
I used typeof operator (GCC extension), which allows to declare a local variable, that has the same type as the parameter.

How I kill a process with suspicious TCP CLOSE_WAIT

Seong-Kook Shin

2013-07-10 00:00

Comments

During our server-side application development, we encontered a lot of connections are in CLOSE_WAIT state, so that our server process is out of file descriptors. We are in the middle of development of a client application that runs in the mobile androids, and the server-side application that runs in a cloud infrastrure.

I'm in the server-side team, and our team is focusing on the development of server-side. Our server-side have multiple front-end server that expose the interface for the clients. Front-end servers are like load-balancers, they dispatch the client requests to the one of the several back-end workers. Since we're in the middle of the development, our front-end servers and back-end servers have a couple of bugs in them. They sometimes made the server crash, even hang unpredictively.

Unfortunately, while we were tring to stablize our server applications, the client team needed a prototype server cluster, so that they can develop their client application and test the interaction between client and the front-end. Personally, I don't want to provide our prototype servers to the client team until the server-side is stablized, but the client team also need to hurry, to meet the dead-line, so we have no choice but to provide them still-unstable-servers.

The biggest problem was, the server application leaves CLOSE_WAIT state TCP connections on unexpected network situation. So, after a couple of hours, the server process ran out of file descriptors, denying client requests. Since we use sophiscated third-party network library, it would take some times to fix the problem.

So, I need some kind of watchdog, which periodically check whether the server process leaves CLOSE_WAIT connections, and kill them, leave some logs, and so on. Our server application is managed by init(1) like launcher, so when the server processes are terminated, the launcher automatically raise them.

Implementation

I was in hurry to implement this wachdog program, so I decided to write small bash script, but later changed to Ruby script. Fortunately, all of our servers already have Ruby 1.8 installed.

At some time slice, the output of the netstat(1) would like this:

$ netstat -ntp
...
tcp  0  0  10.149.8.221:46271  54.235.151.255:6379  ESTABLISHED 16125/fe-server
tcp  0  0  10.149.8.221:46283  54.235.151.255:6379  ESTABLISHED 16118/fe-server          
tcp  0  0  10.149.8.221:46267  54.235.151.255:6379  ESTABLISHED 16120/fe-server          
tcp  0  0  10.149.8.221:35250  10.158.95.68:58964   CLOSE_WAIT  16063/fe-server   
tcp  0  0  10.149.8.221:43557  10.147.191.96:52421  ESTABLISHED 16063/fe-server
tcp  0  0  10.149.8.221:8010   107.22.161.62:37126  CLOSE_WAIT  -
...
$ _

The netstat(1) from net-tools, accept -n option, indicates to use numerical addresses and ports, -t options indicates to show only TCP connections, and -p options to show the related PID and program names.

It looks trival to catch the PID of the process that has one or more CLOSE_WAIT connections. One thing to keep in mind is, netstat(1) sometimes displays "-" in the PID/PROGRAM field. I don't have enough time when netstat(1) shows "-", but fortunately, fuser(1) can identify the owner PID of the connection.

$ fuser -n tcp 8010
35250/tcp:           16063
$ fuser -n tcp 8010 2>/dev/null
 16063$_

My first implementation was, just simply count the number of CLOSE_WAIT connections per process, and kill(1) $PID if the process has more than N CLOSE_WAIT connections.

The limitation of the first implementation is, it may kill the process with CLOSE_WAIT connection that the process just about to close(2) it.

So the second implementation work like this:

save the connection information (source address:port, destination address:port) per process as a set-like container
Wait for certain amount of the time
save the connection information again, in another set-like container.
Get the intersection of the two set.
If the number of elements in the intersection exceeds N, kill the process.

I couldn't come up with a good implementation of set-like container in bash(1), so I re-implement from the scratch with ruby(1).

After few hours, another problem arised. Some server processes, goes coma, and does not adhere to SIGTERM. We can only kill them with SIGKILL, so I modified the killing line like this:

kill $pid; sleep 2; kill -0 $pid && kill -9 $pid

This line, first send SIGTERM to the $pid, then sleep for 2 seconds, and if it still can send a signal to the process (in other words, if the process is still alive), send SIGKILL to the $pid.

Usage

I named the script, resreap. The reason was, we call our server processes as resources, so it stands for 'RESource REAPer'. The full source code is available from here.

With some extra features, my script, called resreap, can accept following options:

$ ./resreap --help
Kill processes that have enough CLOSE_WAIT socket(s)
Usage: resreap [OPTION...]

    -f PAT        Kill only processes whose command matches PAT
    -F HOST:PORT  Ignore if foreign endpoint matches to HOST:PORT
		  HOST should be in IPv4 numerical notation.

    -l N          If a process has more than or equal to N CLOSE_WAIT
		  socket(s), it will be killed with a signal
		  (default: 2)

    -i N          Set sleep interval between checks in seconds
		  (default: 2)

    -c CMD        Before sending a signal, execute CMD in the shell,
		  If this CMD returns non-zero returns, the process
		  will not receive any signal.

    -s SIG        Set the signal name (e.g. TERM) that will be send
		  to a process (default: TERM)
    -w SEC        Set the waiting time in seconds between the signal and
		  SIGKILL (default: 2)

    -d            dry run, no kill
    -D            debug mode

    -h            show this poor help messages and exit
    -v            show version information and exit

Note that if a process receives the signal, and the process is alive
for 2 second(s), the process will receive SIGKILL.

If you are going to use "-f" option, I recommend to try "-d -D" option
first.  If you get the pid of the culprit process, try to get the
command name by "ps -p PID -o command=" where PID is the pid of that
process.

You could send two signal(s) before sending SIGKILL using '-S' option.
This can be useful since some JVM print stacktrace on SIGQUIT.

$ _

For example, if you want to kill a process if it has more than 2 CLOSE_WAIT connections, and you only care for java program, then you can do:

$ ./resreap -l 2 -f ^java

Plus, if you want to ignore CLOSE_WAIT connection on 127.0.0.1:2049, you could do:

$ ./resreap -F 127.0.0.1:2049

I really hope that we don't need to use this awful script for our servers. :)