Date: Wed, 9 Apr 1997 14:19:13 +0200 (MET DST) From: frank@wins.uva.nl (Frank van der Linden) To: freebsd-hackers@freebsd.org, tech-kern@netbsd.org Subject: NFSv3 cookie jar Message-ID: <199704091219.OAA03150@bsd.wins.uva.nl>
next in thread | raw e-mail | index | archive | help
The following problem has been lying around for too long, so I'd like to finally solve it. I first stumbled on it over a year ago, when testing the NFSv3 code integrated into NetBSD against Solaris 2.5, and had several other run-ins with it since. I saw that FreeBSD also encountered the problem sometime last month, so I'll send this to freebsd-hackers as well as tech-kern, to avoid Yet Another Duplicate Effort. In NFSv2, directory offsets were specified as 32 bits, and they were real offsets, i.e. they could be interpreted as numbers. NFSv3 changed a couple of things. First of all, all offsets became 64 bits. For some reason, directory offsets became opaque, they can no longer be interpreted as numbers. In practice this will probably still be possible most of the time, but you can't really take chances; it's the spec that says you can't do it after all. Also, NFSv3 introduced cookie verifiers, opaque entities returned by the server after each directory operation, to be passed along by the client in subsequent operations. These verifiers can be used by the server to see whether the directory has been modified in the meantime. Invalid cookies can be detected that way. Now, this all sounds like an improvement. However, this leaves some problems for people implementing NFSv3. Like: what are the criteria for the server to return a 'bad cookie' error, i.e: what constitutes a change in a directory such that old offsets are now invalid? More seriously: what to do at the client side when the server returns a 'bad cookie' error? Rick Macklem's code uses the filerev field from the vnode attributes to check whether a directory has been modified. On the other hand, Solaris 2.5 doesn't do any checks at all. Their server always returns a 0 cookie. The problems start to appear.. The BSD filerev check turned out to be too strict for Solaris 2.5 clients, or at least: for the user on the Solaris 2.5 client. Solaris has adopted the policy that a 'bad cookie' error is passed up to getdents() as an error (EINVAL). I guess it's one way to go. However, programs bail out because of this at weird moments. Whoever expects getdents() to return EINVAL because of this? I 'fixed' this about a year ago by removing the filerev check from the BSD server code. I'm not saying that Solaris' approach of passing on the error to the userlevel is that bad. After all, what is a good method of recovering? The BSD code simpy re-reads all of the directory blocks until it hits the right offset again whenever it gets NFSERR_BAD_COOKIE. However, suppose you have a directory of 3 blocks. You read the first block. Your offset is now at the end of the first block. You delete all the files in the first block. You want to read the 2nd block. You get BAD_COOKIE. So then you start again from the beginning, until you are at the wanted offset. However, the first block has disappeared now, so your offset lands you at what was originally the 3rd block. You've missed the 2nd block entirely. The best way to solve this is probably to only use userland code that doesn't mix create/remove/rename operations with getdirentries/readdir operations. Things are actually mostly OK for the standard BSD utilities, because they use fts(3), and this reads in the entire directory before doing anything. Another way to go would be to have opendir() read in the entire directory, so that other applications using that interface would also be safe. Other systems that you may have as client might still fail, but for them, all that you can do is take out the BAD_COOKIE check entirely. A problem will be emulated binaries, such as SVR4 binaries, that will do reads in 1048 byte chunks, mixed with dir operations. Yet another possibility would be to do read-aheads in the NFS bio code whenever a directory is read at offset 0, pulling in the whole dir (within reasonable limits..).. ok that's just a thought. Another issue is how to deal with the 64 bit cookies in the BSD code. There's no 64 bit field in the buffer struct to store them. What Rick did (I assume to minimize the changes to the rest of the kernel) is to maintain a mapping between offsets and cookies per nfs node, iff VDIR at least. This information is invalidated whenever an NFS dir buffer is invalidated. The problem with the code implementing this, is that it can't distinguish between EOF and a bad cookie. This can have some unexpected results, i.e. the layer above thinks EOF has reached, when it was in fact the result of invalidated cookie info. With the result that you end up missing some files in the directory (a whole block). I disabled nfs_invaldir to prevent this, letting the server take care of signalling bad cookies. This has a (much less frequent) effect of sometimes seeing duplicate files, so it's not a great solution either. I know that FreeBSD's current code does distinguish between EOF and a bad cookie, but while this is a fix, it is still prone to the error of losing a block mentioned 2 paragraphs above. All in all, I've come to the conclusion that patching directory(3) to always read the whole directory might be the best thing to do. For emulated binaries, well.. the emulation could try a read-ahead (for in-kernel emulations that is), but it may be impossible to get completely right. Basically, this means: 1) Get rid of nfs_invaldir and the seperate cookie lists. 2) Be able to store a 64bit quantity in a struct buf. This would either mean an extra field, or make daddr_t 64 bits wide. 3) Pass the offset cookies up unmodified (interface change to VOP_READDIR: u_long * -> off_t *). This change could probably be avoided, but it would be very inconsistent not to do so. 4) The last argument to getdirentries(2) becomes an off_t *, not a long *, so that the 64bit offset cookies can be used by directory(3) functions. 5) The kernel will make no attempt to recover from a BAD_COOKIE error, and just make getdirentries(2) return EINVAL. 6) opendir(3) will read in all of the directory if it sees that the directory is on NFS. This is currently already done for union directories, so it's a small change. opendir(3) should restart the operation of reading the whole directory if it gets EINVAL (i.e. the directory was modified while it was reading), to make sure a consistent view of the directory is obtained. I might have missed some details, so please tell me if I did. If not, I'd like to do an experimental implementation of this soon and test it; the changes aren't that big. - Frank
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199704091219.OAA03150>