Date: Tue, 12 Nov 2002 10:39:21 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Tomas Pluskal <plusik@pohoda.cz> Cc: freebsd-fs@freebsd.org Subject: Re: seeking help to rewrite the msdos filesystem Message-ID: <3DD14AD9.DF8D3580@mindspring.com> References: <20021112134213.P32524-100000@localhost>
next in thread | previous in thread | raw e-mail | index | archive | help
Tomas Pluskal wrote: > > No; mostly it's non-page aligned accesses, and the fact that > > serial access requires traversal of the entire file up to that > > point, and so becomes exponential, the further you get into > > the file. > > I believe that non-page aligned accesses would not make it 10 times > slower, so main problem lies in the traversal of the file ? You can lose a factor of 3 on this; basically, any 1K block you have will end up faulting the next page to get the last 512b of the last sector that starts at 512b from the end of the previous page. The issue here is one of doubled latency, every 4th block in the FS. In almost all optimization situations, after dealing with the obvious (e.g. repetition of operations that should not need to be repeated interior to loops, etc.), optimization will boil down to an issue of reducing latency on data propagation. > Is it a problem even if I am reading the file from beginning to end ? Yes. > Why? Sequential access has a specific optimization available, which is to use the last block to find the next block, but this fact is not cached. You could specifically attempt to speed up the sequential access by doing a cache-behind, and, when asking for a specific offest, knowing that the last block was at offset N, and knowing its contents are pointed to at XXX, so that when you asked for offset N+1, you could go directly to the in-core block, instead of linearly re-traversing. It's possible. The problem with doing this is: (1) It's not thre *now*, and (2) there is no guarantee that XXX is still in core. Cached data is hung off the vnode, not off the inode, and the only safe place to do the caching is in the inode, but there is no invalidation notification below the level of vnode, which is the caching object. In other words, the metadata is cached off the vnode of the mount point, not off the in-core image of the inode of the file you are reading from, and even if it were pointed to off the inode, it would be invalidated, LRU, off the vnode of the mount point, so it could disappear out from under the inode that's pointing to it. Hence the suggestion that metadata be explicitly cached, and thus explicitly referenced: and therefore the file allocation table area of interest, paged in from an in-core index. > > http://www.usenix.org/publications/library/proceedings/sf94/forin.html > > I'll read this through. Be aware that, like I said, the conclusions are a bit farcical and contrived. For all that, it presents a number of techniques that will be useful for speeding up FAT access by file offset. One of the issues here is that you will need "last close" notification, and you are probably also interested in data caching to have a preference for "open files, largest to smallest, within an LRU consistency window"; this is really not counter-intuitive, if you think about a shorter file with 5 opens on it or lots of random access, v.s a longer file with less frequent access, but both of which are better cached than a small file with a lot of access (effectively, this is a Laffer curve). Probably the best approach, if you have a limited cache, is to virtualize by offset in a btree, and over a certain depth, virtualize and retraverse with a lower persistancy cache for the remainder of the data. So, say, if you btree'ed an indec for a file, maybe the hard index would stop at every 4th block, or every 8th block. This would be enough to divide the offset of a new request by 8, then traverse 3 block to get to the data, rather than traversing 2^N+3 blocks for a btree depth of N. [This would be easier with a whiteboard, but I can draw an "ASCII art" picture, if necessary] > > However, it does have a couple of good suggestions for speeding > > FAT file access up. The most useful is that you should cache > > the metadata information, so that you can traverse the list of > > FS blocks in memory, instead of on disk, and caching of all the > > directory information (for the same reasons). Because the directory > > entry is also the inode, in FAT, this has the effect of caching all > > the inodes, as well. > > Is this metadata caching the main difference of FreeBSD from > linux/windows, that makes it so slow ? Windows certainly caches metadata. Windows also has other FS attributes that are rather odd, that will cause improved FS performance, at the epxense of other system performance. One specific example is that memory under Windows is obtained from a module called VMM32.VXD; everything which obtains memory from this module is required to register a "low memory condition" callback, which the system can call and say "give back those pages you can, give back, since I am low on memory". The interface is written such that the system can ask for memory back, and then each subsystem can return multiple pages back to the system on each call to it by the system. The VFAT32.VXD IF modules specifically gives memory back only a single page at a time, unfairly competing with other subsystems, and user prorgams. which the documentation for the system specifically recommends to return as much memory as possible on each request. If you are an FS developer writing code in the IFSMgr under Windows, and you don't know this, you will never be able to write FS code that is capable of competeing, side by side in the same system, against a VFAT32 FS under benchmark conditions (e.g. as a file server vs. "NetBench"). I would have to examine the Linux FAT FS code in more detail to know what they currently do, with regard to caching. When FS's were smaller, and Udo Walter was working on the code (say circa 1996), they did some aggressive caching, which they have probably tuned down, as Windows FS's and disk space have both increased. > > I don't think the issue is actually clustering. If the cluster > > size is set, you really don't have a choice, since your block > > chain is hung off that, so it's not like FreeBSD goes out of its > > way to pessimize access. > > I've been doing some experiments with iostat, and I've figured out that on > FAT filesystem the kernel is generating very small IO requests (cluster > size), but with ufs the request size is much bigger (128KB). > I have thought that this was the problem. This has more to do with sequential access. Technically, you can read a FAT cluster at a time instead of an FS block at a time, and you will achieve some multiplier on sequential access, but you will find that under load, that the fault rate for blocks will go up. Also, even if you read 64K at a time, you will end up LRU'ing out the data that you don't access. The issue is that UNIX files are accessed by offset, and FAT files are accessed by offset by chaining clusters from the start to the cluster of interest, and then reading blocks. Since you can't cache non-metadata references by ofset at the inode level, and can only do it at the inode level as a reference into the device cache hung off the device vnode (which can then be faulted), there's an impedence mismatch. If you force the caching in seperate memory for as much of the FAT as you can, then that buys you the cluster offsets, and you can cache the blocks containing the chaining information, as well. Rather than trying to design this for you on the fly, the best thing you can do is profile the code, so that you can compare how things are with how things end up. This is really needed, if you plan on using this as a school project, and want to write up papers. Also, the actual empirical choice on cache depth and other tradeoffs are going to be based on intended access patterns. So if you want to get a good paper out of it, you'll probably implement as many types of instrumentation, tunables, and attempts to speed up what's slow in profile, that you can. The more you do, the better your results will be, and me telling you what's worked in the past any more will bias your efforts unfairly; I'd rather just give you a direction to look in, and have you reach whatever conclusions you reach. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3DD14AD9.DF8D3580>