Date: Mon, 9 Jul 2012 11:38:15 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-fs@freebsd.org Cc: pho@freebsd.org, Konstantin Belousov <kib@freebsd.org> Subject: Re: close() of an flock'd file is not atomic Message-ID: <201207091138.15655.jhb@freebsd.org> In-Reply-To: <201206060817.54684.jhb@freebsd.org> References: <201203071318.08241.jhb@freebsd.org> <201203161406.27549.jhb@freebsd.org> <201206060817.54684.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, June 06, 2012 8:17:54 am John Baldwin wrote: > On Friday, March 16, 2012 2:06:27 pm John Baldwin wrote: > > On Friday, March 09, 2012 10:59:29 am John Baldwin wrote: > > > On Thursday, March 08, 2012 5:39:19 pm Konstantin Belousov wrote: > > > > On Thu, Mar 08, 2012 at 03:39:07PM -0500, John Baldwin wrote: > > > > > On Wednesday, March 07, 2012 1:18:07 pm John Baldwin wrote: > > > > > > So I ran into this problem at work. Suppose you have a process that opens a > > > > > > read-write file descriptor with O_EXLOCK (so it has an flock()). It then > > > > > > writes out a binary into that file. Another process wants to execve() the > > > > > > file when it is ready, so it opens the file with O_EXLOCK (or O_SHLOCK), and > > > > > > will call execve() once it has locked the file. In theory, what should happen > > > > > > is that the second process should wait until the first process has finished > > > > > > and called close(). In practice what happens is that I occasionally see the > > > > > > second process fail with ETXTBUSY. > > > > > > > > > > > > The bug is that the vn_closefile() does the VOP_ADVLOCK() to unlock the file > > > > > > separately from the call to vn_close() which drops the writecount. Thus, the > > > > > > second process can do an open() and flock() of the file and subsequently call > > > > > > execve() after the first process has done the VOP_ADVLOCK(), but before it > > > > > > calls into vn_close(). In fact, since vn_close() requires a write lock on the > > > > > > vnode, this turns out to not be too hard to reproduce at all. Below is a > > > > > > simple test program that reproduces this constantly. To use, copy /bin/test > > > > > > to some other file (e.g. /tmp/foo) and make it writable (chmod a+w), then run > > > > > > ./flock_close_race /tmp/foo. > > > > > > > > > > > > The "fix" I came up with is to defer calling VOP_ADVLOCK() to release the lock > > > > > > until after vn_close() executes. However, even with that fix applied, my test > > > > > > case still fails. Now it is because open() with a given lock flag is > > > > > > non-atomic in that the open(O_RDWR) will call vn_open() and bump v_writecount > > > > > > before it blocks on the lock due to O_EXLOCK, so even though the 'exec_child' > > > > > > process has the fd locked, the writecount can still be bumped. One gross hack > > > > > > would be to defer the bump of the writecount to the caller of vn_open() if the > > > > > > caller passes in O_EXLOCK or O_SHLOCK, but that's a really gross kludge, plus > > > > > > it doesn't actually work. I ended up moving acquiring the lock into > > > > > > vn_open_cred(). The current patch I'm testing has both of these approaches, > > > > > > but the first one is #if 0'd out, and the second is #if 1'd. > > > > > > > > > > > > http://www.freebsd.org/~jhb/patches/flock_open_close.patch > > > > > > > > > > Based on some feedback from Konstantin, I've fixed some issues in the failure > > > > > path handling for VOP_ADVLOCK(). I've also removed the #if 0'd code mentioned > > > > > above, so the patch is now the actual change that I'm testing. So far it > > > > > handles both my workload at work and my test program without any issues. > > > > > > > > I think a comment is needed for a reason to call vn_writechk() second time. > > > > > > Fixed. > > > > > > > Could you, please, point me, where the FHASLOCK is set for O_EXLOCK | O_SHLOCK > > > > case in the patched kernel ? > > > > > > It wasn't. :( I wonder how this was even working since close shouldn't have > > > been unlocking. I'll need to do some more testing. BTW, I ran into fhopen() > > > and found that I would need to put all this same logic into that, so I've split > > > the common code from fhopen() and vn_open_cred() into a new vn_open_vnode(). > > > I think in general it improves both sets of code. > > > > > > I'll upate the patch once I've done some more testing. > > Based on feedback from Konstantin, I have split the vn_open_vnode() changes > out into a separate patch. Once that patch is in the tree I will revisit > this and update the actual bug-fix patch. > > The vn_open_vnode() patch is at > http://www.freebsd.org/~jhb/patches/vn_open_vnode.patch > > I tested it by doing a buildworld -j 32 in a loop while NFS exporting the > /usr/obj tree to another machine that did a continual find | xargs md5 loop > over the /usr/obj tree. This survived overnight. Here now is the tested version of the actual fix after the vn_open_vnode() changes were committed. This is hopefully easier to parse now. http://www.FreeBSD.org/~jhb/patches/flock_open_close4.patch I'm enclosing an updated copy of the test program below: #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> #include <err.h> #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> static void usage(void) { fprintf(stderr, "Usage: flock_close_race <binary> [args]\n"); exit(1); } static void child(const char *binary) { int fd; /* Exit as soon as our parent exits. */ while (getppid() != 1) { fd = open(binary, O_RDWR | O_EXLOCK); if (fd < 0) { /* * This may get ETXTBSY since exit() will * close its open fd's (thus releasing the * lock), before it releases the vmspace (and * mapping of the binary). */ if (errno == ETXTBSY) continue; err(1, "can't open %s", binary); } close(fd); } exit(0); } static void exec_child(char **av) { int fd; fd = open(av[0], O_RDONLY | O_SHLOCK); execv(av[0], av); err(127, "execv"); } int main(int ac, char **av) { struct stat sb; pid_t pid; if (ac < 2) usage(); if (stat(av[1], &sb) != 0) err(1, "stat(%s)", av[1]); if (!S_ISREG(sb.st_mode)) errx(1, "%s not an executable", av[1]); pid = fork(); if (pid < 0) err(1, "fork"); if (pid == 0) child(av[1]); for (;;) { pid = fork(); if (pid < 0) err(1, "vfork"); if (pid == 0) exec_child(av + 1); wait(NULL); } return (0); } -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201207091138.15655.jhb>