From owner-freebsd-fs@FreeBSD.ORG Wed Mar 7 19:07:11 2012 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 82D2E106564A; Wed, 7 Mar 2012 19:07:11 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 42B5F8FC19; Wed, 7 Mar 2012 19:07:11 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [96.47.65.170]) by cyrus.watson.org (Postfix) with ESMTPSA id CE7A846B8E; Wed, 7 Mar 2012 14:07:05 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 400F0B911; Wed, 7 Mar 2012 14:07:05 -0500 (EST) From: John Baldwin To: fs@freebsd.org Date: Wed, 7 Mar 2012 13:18:07 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p10; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201203071318.08241.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 07 Mar 2012 14:07:05 -0500 (EST) Cc: pho@freebsd.org, kib@freebsd.org Subject: close() of an flock'd file is not atomic X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Mar 2012 19:07:11 -0000 So I ran into this problem at work. Suppose you have a process that opens a read-write file descriptor with O_EXLOCK (so it has an flock()). It then writes out a binary into that file. Another process wants to execve() the file when it is ready, so it opens the file with O_EXLOCK (or O_SHLOCK), and will call execve() once it has locked the file. In theory, what should happen is that the second process should wait until the first process has finished and called close(). In practice what happens is that I occasionally see the second process fail with ETXTBUSY. The bug is that the vn_closefile() does the VOP_ADVLOCK() to unlock the file separately from the call to vn_close() which drops the writecount. Thus, the second process can do an open() and flock() of the file and subsequently call execve() after the first process has done the VOP_ADVLOCK(), but before it calls into vn_close(). In fact, since vn_close() requires a write lock on the vnode, this turns out to not be too hard to reproduce at all. Below is a simple test program that reproduces this constantly. To use, copy /bin/test to some other file (e.g. /tmp/foo) and make it writable (chmod a+w), then run ./flock_close_race /tmp/foo. The "fix" I came up with is to defer calling VOP_ADVLOCK() to release the lock until after vn_close() executes. However, even with that fix applied, my test case still fails. Now it is because open() with a given lock flag is non-atomic in that the open(O_RDWR) will call vn_open() and bump v_writecount before it blocks on the lock due to O_EXLOCK, so even though the 'exec_child' process has the fd locked, the writecount can still be bumped. One gross hack would be to defer the bump of the writecount to the caller of vn_open() if the caller passes in O_EXLOCK or O_SHLOCK, but that's a really gross kludge, plus it doesn't actually work. I ended up moving acquiring the lock into vn_open_cred(). The current patch I'm testing has both of these approaches, but the first one is #if 0'd out, and the second is #if 1'd. http://www.freebsd.org/~jhb/patches/flock_open_close.patch #include #include #include #include #include #include #include #include #include static void usage(void) { fprintf(stderr, "Usage: flock_close_race [args]\n"); exit(1); } static void child(const char *binary) { int fd; /* Exit as soon as our parent exits. */ while (getppid() != 1) { fd = open(binary, O_RDWR | O_EXLOCK); if (fd < 0) err(1, "can't open %s", binary); close(fd); } exit(0); } static void exec_child(char **av) { int fd; fd = open(av[0], O_RDONLY | O_SHLOCK); execv(av[0], av); err(127, "execv"); } int main(int ac, char **av) { struct stat sb; pid_t pid; if (ac < 2) usage(); if (stat(av[1], &sb) != 0) err(1, "stat(%s)", av[1]); if (!S_ISREG(sb.st_mode)) errx(1, "%s not an executable", av[1]); pid = fork(); if (pid < 0) err(1, "fork"); if (pid == 0) child(av[1]); for (;;) { pid = fork(); if (pid < 0) err(1, "vfork"); if (pid == 0) exec_child(av + 1); wait(NULL); } return (0); } -- John Baldwin