From owner-freebsd-fs@FreeBSD.ORG  Wed Mar  7 19:07:11 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 82D2E106564A;
	Wed,  7 Mar 2012 19:07:11 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 42B5F8FC19;
	Wed,  7 Mar 2012 19:07:11 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170])
	by cyrus.watson.org (Postfix) with ESMTPSA id CE7A846B8E;
	Wed,  7 Mar 2012 14:07:05 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 400F0B911;
	Wed,  7 Mar 2012 14:07:05 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: fs@freebsd.org
Date: Wed, 7 Mar 2012 13:18:07 -0500
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: 7bit
Message-Id: <201203071318.08241.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Wed, 07 Mar 2012 14:07:05 -0500 (EST)
Cc: pho@freebsd.org, kib@freebsd.org
Subject: close() of an flock'd file is not atomic
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Mar 2012 19:07:11 -0000

So I ran into this problem at work.  Suppose you have a process that opens a 
read-write file descriptor with O_EXLOCK (so it has an flock()).  It then 
writes out a binary into that file.  Another process wants to execve() the 
file when it is ready, so it opens the file with O_EXLOCK (or O_SHLOCK), and 
will call execve() once it has locked the file.  In theory, what should happen 
is that the second process should wait until the first process has finished 
and called close().  In practice what happens is that I occasionally see the 
second process fail with ETXTBUSY.

The bug is that the vn_closefile() does the VOP_ADVLOCK() to unlock the file 
separately from the call to vn_close() which drops the writecount.  Thus, the 
second process can do an open() and flock() of the file and subsequently call
execve() after the first process has done the VOP_ADVLOCK(), but before it 
calls into vn_close().  In fact, since vn_close() requires a write lock on the 
vnode, this turns out to not be too hard to reproduce at all.  Below is a 
simple test program that reproduces this constantly.  To use, copy /bin/test 
to some other file (e.g. /tmp/foo) and make it writable (chmod a+w), then run 
./flock_close_race /tmp/foo.

The "fix" I came up with is to defer calling VOP_ADVLOCK() to release the lock 
until after vn_close() executes.  However, even with that fix applied, my test
case still fails.  Now it is because open() with a given lock flag is
non-atomic in that the open(O_RDWR) will call vn_open() and bump v_writecount
before it blocks on the lock due to O_EXLOCK, so even though the 'exec_child' 
process has the fd locked, the writecount can still be bumped.  One gross hack
would be to defer the bump of the writecount to the caller of vn_open() if the
caller passes in O_EXLOCK or O_SHLOCK, but that's a really gross kludge, plus
it doesn't actually work.  I ended up moving acquiring the lock into 
vn_open_cred().  The current patch I'm testing has both of these approaches,
but the first one is #if 0'd out, and the second is #if 1'd.

http://www.freebsd.org/~jhb/patches/flock_open_close.patch

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static void
usage(void)
{
	fprintf(stderr, "Usage: flock_close_race <binary> [args]\n");
	exit(1);
}

static void
child(const char *binary)
{
	int fd;

	/* Exit as soon as our parent exits. */
	while (getppid() != 1) {
		fd = open(binary, O_RDWR | O_EXLOCK);
		if (fd < 0)
			err(1, "can't open %s", binary);
		close(fd);
	}
	exit(0);
}

static void
exec_child(char **av)
{
	int fd;

	fd = open(av[0], O_RDONLY | O_SHLOCK);
	execv(av[0], av);
	err(127, "execv");
}

int
main(int ac, char **av)
{
	struct stat sb;
	pid_t pid;

	if (ac < 2)
		usage();
	if (stat(av[1], &sb) != 0)
		err(1, "stat(%s)", av[1]);
	if (!S_ISREG(sb.st_mode))
		errx(1, "%s not an executable", av[1]);

	pid = fork();
	if (pid < 0)
		err(1, "fork");
	if (pid == 0)
		child(av[1]);
	for (;;) {
		pid = fork();
		if (pid < 0)
			err(1, "vfork");
		if (pid == 0)
			exec_child(av + 1);
		wait(NULL);
	}
	return (0);
}


-- 
John Baldwin