Date: Sat, 28 Mar 2015 13:44:57 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alexander Motin <mav@freebsd.org> Cc: freebsd-fs@freebsd.org, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: MAXBSIZE increase Message-ID: <20150328111733.L963@besplex.bde.org> In-Reply-To: <5515C421.4040703@FreeBSD.org> References: <5515C421.4040703@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> Experimenting with NFS and ZFS I found an inter-operation issue: ZFS by > default uses block of 128KB, while FreeBSD NFS (both client and server) > is limited to 64KB requests by the value of MAXBSIZE. On file rewrite > that limitation makes ZFS to do slow read-modify-write cycles for every > write operation, instead of just writing the new data. Trivial iozone > test show major difference between initial write and rewrite speeds > because of this issue. > > Looking through the sources I've found and in r280347 fixed number of > improper MAXBSIZE use cases in device drivers. After that I see no any > reason why MAXBSIZE can not be increased to at least 128KB to match ZFS > default (ZFS now supports block up to 1MB, but that is not default and > so far rare). I've made a test build and also successfully created UFS > file system with 128KB block -- not sure it is needed, but seems it > survives this change well too. > > Is there anything I am missing, or it is safe to rise this limit now? I see the following minor problems: - static and dynamic allocation of MAXBSIZE bytes would be more wasteful than before. - boot blocks used to do static allocation of MAXBSIZE bytes. Now they just do ffs's sanity check that the block size is less that that. A block size larger than this is not necessarily invalid, but just unsupported by the running instance of the buffer cache layer (so unsupported by the running instance of ffs too). Another or the same OS may have a larger MAXBSIZE, and the user may have broken portability by actually using this to create a file system that cannot be read by OS's with the historical MAXBSIZE. This check is bogus for boot blocks, since they don't use the buffer cache layer. ufsread.c uses a sort of anti-buffer-cache to avoid problems but give extreme slowness. It uses a virtual block size of 4K and does i/o 4K at a time with no caching. The buffer must not cross a 64K boundary on x86, and the MI code states this requirement for all arches. In i386/boot2, dmadat is 64K-aligned so the virtual buffer size could be up to 64K, except dmadat is used for 3 other buffers and only 4K is used for the data buffer. The data structure for this is: X /* Buffers that must not span a 64k boundary. */ X struct dmadat { X char blkbuf[VBLKSIZE]; /* filesystem blocks */ X char indbuf[VBLKSIZE]; /* indir blocks */ X char sbbuf[SBLOCKSIZE]; /* superblock */ X char secbuf[DEV_BSIZE]; /* for MBR/disklabel */ X }; X static struct dmadat *dmadat; I don't like the FreeBSD boot code, and use my version of biosboot if possible. I expanded its buffers and improved its caching a year or 2 ago. Old versions have 2 buffers of size MAXBSIZE in commented- out code since this doesn't work, especially when written in C. The commented-out code also sets a size of 4K for one of these buffers. This last worked, for the default ffs block size only, in about 1990 (this code is from Mach). The old code actually uses 3 buffers of size 8K, corresponding to 3 of the 4 buffers in dmadat. This broke about 15 years ago when the default and normal ffs block size was increased to 16K. I fixed this by allocating all of the buffers in asm. From start.S: X ENTRY(disklabel) X . = EXT(boot1) + 0x200 + 0*276 + 1*0x200 X X .globl EXT(buf) X .set EXT(buf), EXT(boot1) + 0x20000 X .globl EXT(iobuf) X .set EXT(iobuf), EXT(boot1) + 0x30000 X .globl EXT(mapbuf) X .set EXT(mapbuf), EXT(boot1) + 0x40000 boot1 is loaded at a 64K boundary and overlaid with boot2, the same as in the -current boot2. The above bits off 64K pieces of the heap for all large data structures. boot2 only (64K?) for dmadat instead, using hackish C code. Then I improved the caching. biosboot was using my old code which did caching mainly for floppies, since old systems were too slow to keep up with reading even floppies 1 512-block at a time. It used a read-ahead buffer of size 18*512 = 9K to optimize for floppies up to size 1440K. This worked OK for old hard disks with the old default ffs block size of 8K too. But it gives much the same anti-caching as -current's virtual 4K buffers when the ffs block size is 16K or larger. I didn't expand the cache to a large one on the heap, but just changed it to 32*512 = 16K to work well with my default ffs block size of 16K (32K is pessimal for my old disk), and fixed some alignment problems (the old code attempts to align on track boundaries but tracks don't exist for modern hard disks, and the alignment needs to be to ffs block boundaries else 16K-blocks would be split every time in the 16K "cache". Summary for the boot blocks: they seem to be unaffected by increasing MAXBSIZE, but their anti-cache works even better for fragmenting larger blocks. - the buffer cache is still optimized for i386 with BKVASIZE = 8K. 64-bit systems don't need the complications and pessimizations to fit in i386's limited kva, but have them anyway. When the default ffs block size was doubled to 16K, BKVASIZE was doubled to match, but the tuning wasn't doubled to match. This reduced the effective number of buffers by a factor of 2. This pessimization was mostly unnoticable, since memory sizes grew by more than a factor of 2 and and nbuf grew by about a factor of 2. But increasing (nbuf*BKVASIZE) much more isn't possible on i386 since it reaches a kva limit. Then when ffs's default block size was doubled to 32K, BKVASIZE wasn't even doubled to match. If anyone actually uses the doubled MAXBSIZE, then BKVASIZE will be mistuned by another factor of 2. They probably shouldn't do that. A block size of 64K already works poorly in ffs. Relative to a block size of 32K, It mainly doubles the size for metadata i/o without making much difference for data i/o, since data i/o is clustered. An fs block size equal to MAXPHYS also makes clustering useless, by limiting the maximum number of blocks per cluster to 1. That is better than the ratio of 4/32 in ufsread and 9/{8,16,32} in old biosboot, but still silly. Large fs block sizes (where "large" is about 2K) are only good when clustering doesn't work and the disk doesn't like small blocks. This may be the case for ffs on newer hard disks. Metdata is not clustered for ffs. My old hard disks like any size larger than 16K, but my not so old hard disks prefer 32K or above. nfs for zfs will actually use the new MAXBSIZE. I don't like it using a hard-coded size. It gives buffer cache fragmentation. The new MAXBSIZE will non-accidentally match the fs block size for zfs, but even the old MAXBSIZE doesn't match the usual fs block size for any file system. - cd9660 mount uses MAXBSIZE for a sanity check. It can only support block sizes up to that, but there must be an fs limit. It should probably use min(CD9660_MAXBSIZE, MAXBSIZE). - similarly in msdosfs, except I'm sure that there is an fs limit of 64K. Microsoft specifies that the limit is 32K, but 64K works in FreeBSD and perhaps even in Microsoft OS's. - similarly in ffs, except the ffs limit is historically identical to MAXBSIZE. I think it goes the other way -- MAXBSIZE = 64K is the historical ffs limit, and the buffer cache has to support that. Perhaps ffs should remain at its historical limit. The lower limit is still local in ffs. It is named MINBSIZE. Its value is 4K in -current but 512 in my version. ffs has no fundamental limit at either 4K or 64K, and can support any size supported by the hardware after fixing some bugs involving assumptions that the superblock fits in an ffs block. - many file systems use MAXBSIZE to limit the read-ahead for cluster_read(). This seems wrong. cluster_read() has a natural limit of geom's virtual "best" i/o size (normally MAXPHYS). The decision about the amount of read-ahead should be left to the clustering code if possible. But it is unclear what this should be. The clustering code gets this wrong anyway. It has sysctls vfs.read_max (default 64) and vfs.write_behind (default 1). The units for these are broken. They are fs-blocks. A read-ahead of 64 fs-blocks of size 512 is too different from a read-ahead of 64 fs-blocks of size MAXBSIZE whatever the latter is. My version uses a read-ahead scaled in 512-blocks (default 256 blocks = MAXPHYS bytes). The default read-ahead shouldn't vary much with either MAXPHYS, MAXBSIZE or the fs block size, but should vary with the device (don't read-ahead 64 large fs blocks on a floppy disk device, as asked for by -current's read_max ...). - ffs utilities like fsck are broken by limiting themselves to the buffer cache limit of MAXBSIZE, like the boot blocks but with less reason since they don't have space constraints and not being limited by the current OS is more useful. Unless MAXBSIZE = 64K is considered to be a private ffs that escaped. Then the ffs code should spell it FFS_MAXBSIZE or 64K. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150328111733.L963>