Date: Tue, 19 Mar 1996 19:10:39 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: msmith@atrad.adelaide.edu.au (Michael Smith) Cc: questions@freebsd.org Subject: Re: Diskless FreeBSD Message-ID: <199603200210.TAA25640@phaeton.artisoft.com> In-Reply-To: <199603192342.KAA02833@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Mar 20, 96 10:12:36 am
next in thread | previous in thread | raw e-mail | index | archive | help
> > > I don't know _why_, I suspect that the NFS swap code doesn't/won't/can't > > > extend the file, but I haven't been bothered enough by it to try to find out. > > > > How can an NFS client know whether the server is zero-filling the > > pages or not? > > Huh? The NFS swapfile, in the eyes of the NFS server, is just a file that > some client is scribbling all over. It's nothing special. The problem is > just that if the file isn't as big as the client expects it to be, for > some reason the client dies. > > Zero-filling here is a total non-sequiter. If you make the swap file a certain size, unless you actually write every FS block between the start and the end of the file, it will be a sparse file: it won't take up the space it says it does. That is, it will have allocated only the necessary intermediate blocks, an inode, and one terminal block (which you cause to be allocated by writing it). Pages "read" from unallocated areas in the file will reference '0' block pointers, which will cause the pages to be created on demand. That is, the file will be a sparse file on the server. There is no easy way for an NFS client to know if the file is sparse or not. So that means it can't care if the server is zero-filling the file. Ie: it can't care that the disk space has not really been allocated or not, unlike a local swap file. Local swap files care, because they can't reenter to page-fill. This has *NOTHING* to do with the fact that the file *MUST* be the right size in the current code, even if there aren't blocks allocated to it. > > Typically, I'd say it can't extend the file because, like mmap, the > > vnode/extent used for cache mapping images (even for swap files) > > references the length from the mapping structure, not from the in > > core vnode. > > In other words, the kernel is told (via the config file) that it has a 20M > swap arena, but the map to the swapfile to back that arena is flawed because > the extent for the file is constrained to the size of the file itself. > Makes sense I gues. No. It assumes that the file you give it is the size you give it in the config file, even if it's really larger or smaller. The problem is that the page size is not greater than or equal to the host file system block size. As a result, when you go to swap out to page 1024 (4M+0) of a file that has no real space allocated to it, the file system must perform a partial block write to write the page. Because the bitmaps on the cache are not used the way the header file implies that they are, a partial block write of a page causes the block containing the page to be read *IF* the write area is not on an FS block alignment boundry OR if the write area is smaller than the FS block size. This has to happen so that if the file contained *real* data in the data block, the part that isn't being written keeps the same data instead of being filled with zeros. Now swapping will *always* do it's writes on page boundries. The screwup occurs because the FS block size is larger than one page. Because there must be a read-before-write for the partial block to be written, and the local NFS clients buffer cache must be read into for the partial FS block write, the NFS client issues a read to the remote host. This read is past the end of the swap file, and therefore fails. The NFS read failure for the full block past the end of the file causes the write to not be attempted, and the file does not get self-extended. The resulting swap write failure panics the client machine. To fix this "problem" (ie: you want to do swap "overcommit" on your NFS server without hacking bootp, which isn't a very good thing to do), you will need to: 1) Make the FS block size on the server 4k, the same as the page size. In addition, you must make the transfer size for the NFS read/write on the server for the swapfile 4k. -OR- 2) You will need to make the 8 bit fractional buffer bitmap work as expected, so that partial block writes do not require block reads to allow them to complete (assuming they are on alignment boundries). Soloution #2 above will also drastically increase FreeBSD's throughput on aligned random writes of an existing file, since it will reduce the number of device blocks which must be written for partial block writes. Aligned writes of size 512, 1024, 2048, and 4096 bytes are used in the Ziff-davis "WinBench 95" benchmark suite. They also do aligned 200 byte writes, which will cause 1 512 byte disk block to be read/written per I/O (assuming no cache hit), or 2 in the case of the 200 byte record overlaying a block boundry. Sequential writes weenie out by getting a cache hit, unless they are doing cache-busting (in which case, this would help those benchmarks, too). Neither of these guarantee that a write of an object not requiring a read will in fact result in the object not being read; there used to be an issue, even with fs_bsize-sized writes on alignment boundries. I know something like this was fixed in the last month or so by John Dyson, but I'm not sure if this was exactly it, or if it was just a similar situation with some other cache interaction. In any case, if you avoid the NFS read, then the NFS write will extend the file, as expected, instead of causing the thing to crash. But even if you get all your block sizes lines up, you may still not be able to avoid the read-before-write, which is the only way to avoid the NFS read past the EOF. I would suggest, instead, starting with no swap, and adding the swap after creating it on the remote system in the client's /etc/rc file. This is not as "clean" as growing swap as necessary, but I think the zero-fill of a sparse file will allow you to create the swap as sparse and fool the client. This still doesn't solve the crash when the NFS client writes and there aren't any real blocks to convert the sparse file into a non-sparse file. Oh well; to do that, you'd need to get into deeper detail on the vnode pager itself. This is doubly difficult because FreeBSD *likes* to use up all its swap if it can to keep extra pages quickly accessable... so it ill probably quicly turn small files into large files if it can (you'd have to modify the page replacement policy to make it act otherwise). I have no idea in hell how you would safely decide that the swap was no longer in use so that it could be reclaimed (I assume each client will need its own swap, and if you can't decide which clients are active, you can't make the file sparse again until the next login of the client that created the thing. Unless you have some kind of rc.shutdown, and can be guaranteed that the clients will call it...). Personally, I'd commit the files in the rc.local and start a *long* UDP "keep-alive" to a daemon that likes to delete "dead" swap files; it seems to be the best bet. This is only if you convinced me to overcommit my NFS server's disk space as swap for too many clients in the first place (a hard sell, to say the least). One possible intermediate fix, if the read-before-write of block sized objects still occurs for an aligned write, is to hack the vnode pager to know it's on an NFS client (you can get the flag that it's remote from the swapfile's vnodes pointer to the fs structure), and if the write would be past the end of the file, extend the file automatically by doing a write to the next FS block boundry past the area you are interested in for one byte. You could do this safely because the area you are writing is guaranteed to not contain good swap data because if it did, you wouldn't be writing past the EOF. You may have to update the clients idea of the file size promiscuously in the local attribute cache if you do this. Again, it won't buy you much because you don't know when it's OK to destroy the swap files. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199603200210.TAA25640>