Date: Wed, 12 Jun 2013 07:25:24 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: fs@FreeBSD.org Subject: Re: missed clustering for small block sizes in cluster_wbuild() Message-ID: <20130612053543.X900@besplex.bde.org> In-Reply-To: <20130611063446.GJ3047@kib.kiev.ua> References: <20130607044845.O24441@besplex.bde.org> <20130611063446.GJ3047@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 11 Jun 2013, Konstantin Belousov wrote: > On Fri, Jun 07, 2013 at 05:28:11AM +1000, Bruce Evans wrote: >> I think this is best fixed be fixed by removing the check above and >> checking here. Then back out of the changes. I don't know this code >> well enough to write the backing out easily. > > Could you test this, please ? It works in limited testing. I got a panic from not adapting for changed locking when merging it to my version, and debugging this showed a problem. > diff --git a/sys/kern/vfs_cluster.c b/sys/kern/vfs_cluster.c > index b280317..9e1528e 100644 > --- a/sys/kern/vfs_cluster.c > +++ b/sys/kern/vfs_cluster.c > ... > @@ -904,14 +904,10 @@ cluster_wbuild(struct vnode *vp, long size, daddr_t start_lbn, int len, > > /* > * Check that the combined cluster > - * would make sense with regard to pages > - * and would not be too large > + * would make sense with regard to pages. > */ The comment needs more changes. There is no check "with regard to pages" now. The old comment was poorly worded. The code never made an extra check that the cluster "would make sense with regard to pages" here (that check was always later). What it did was use the page count in the largeness check. > - if ((tbp->b_bcount != size) || > - ((bp->b_blkno + (dbsize * i)) != > - tbp->b_blkno) || > - ((tbp->b_npages + bp->b_npages) > > - (vp->v_mount->mnt_iosize_max / PAGE_SIZE))) { > + if (tbp->b_bcount != size || > + bp->b_blkno + dbsize * i != tbp->b_blkno) { > BUF_UNLOCK(tbp); > break; > } Contrary to what I said before, the caller doesn't always limit the cluster size. Only the cluster_write() caller does that. The vfs_bio_awrite() doesn't. Now it is fairly common to allocate 1 too many page and have to back out. This happens even when everything is aligned. I observed the following: - there were a lot of contiguous dirty buffers, and this loop happily built up a cluster with 17 pages, though mnt_iosize_max was only 17 pages. Perhaps the extra page is necessary if the part of the buffer to be written starts at a nonzero offset, but there was no offset in the case that I observed (can there be one, and if so, is it limited to an offset within the first page? The general case needs 16 4K extra pages to write a 64K-block (when the offset of the area to be written is 64K-512). - ... > @@ -964,6 +960,22 @@ cluster_wbuild(struct vnode *vp, long size, daddr_t start_lbn, int len, > bp->b_pages[bp->b_npages] = m; > bp->b_npages++; > } > + if (bp->b_npages > vp->v_mount-> > + mnt_iosize_max / PAGE_SIZE) { - ...Then this detects that the 17th page is 1 too many and cleans up. > + KASSERT(i != 0, ("XXX")); > + j++; > + for (jj = 0; jj < j; jj++) { > + vm_page_io_finish(tbp-> > + b_pages[jj]); > + } > + vm_object_pip_subtract( > + m->object, j); > + bqrelse(tbp); > + bp->b_npages -= j; > + VM_OBJECT_WUNLOCK(tbp-> > + b_bufobj->bo_object); > + goto finishcluster; > + } > } > VM_OBJECT_WUNLOCK(tbp->b_bufobj->bo_object); > } I think it would work and fix other bugs to check (tbp->b_bcount + bp->b_bcount <= vp->v_mount->mnt_iosize_max) up front. Drivers should be able to handle an i/o size of b_bcount however many pages that takes. There must be a limit on b_pages, but it seems to be non-critical and the limit on b_bcount gives one of (mnt_iosize_max / PAGE_SIZE) rounded in some way and possibly increased by 1 or doubled to account for offsets. If mnt_iosize_max is not a multiple of PAGE_SIZE, then the limit using pages doesn't even allow covering mnt_iosize_max using pages, since the rounding down is non-null. I found this bug while debugging a recent PR about bad performance and hangs under write pressure. I only have 1 other clearly correct fix for the bad performance. msdosfs is missing read clustering for read-before-write. I didn't notice that this was necessary when I implemented clustering for msdosfs a few years ago. I thought that the following patch was a complete fix, but have found more performance problems in clustering: @ diff -u2 msdosfs_vnops.c~ msdosfs_vnops.c @ --- msdosfs_vnops.c~ Thu Feb 5 19:11:37 2004 @ +++ msdosfs_vnops.c Wed Jun 12 04:01:19 2013 @ @@ -740,5 +756,19 @@ @ * The block we need to write into exists, so read it in. @ */ @ - error = bread(thisvp, bn, pmp->pm_bpcluster, cred, &bp); @ + if ((ioflag >> IO_SEQSHIFT) != 0 && This was cloned from the ffs version. All ffs should call a common function instead of duplicating the cluster_read/bread decision. Similarly for write clustering except there are more decisions. But ffs and ext2fs do this in UFS_BALLOC() and ext2fs_balloc() (?) where not all the info that might be need is available. I repeated the (ioflag >> IO_SEQSHIFT) calculation instead of copying to a variable like ffs does, to localise this patch and to avoid copying ffs's mounds of style bugs in the declaration of the variable. @ + (vp->v_mount->mnt_flag & MNT_NOCLUSTERR) == 0) { @ + error = cluster_read(vp, dep->de_FileSize, bn, @ + pmp->pm_bpcluster, NOCRED, @ +#if 0 @ + (uio->uio_offset & pmp->pm_crbomask) + @ + uio->uio_resid, This part was copied from msdosfs_read(). msdosfs_read() uses the uio of some reader. Here the uio for read-before-write is for the writer. It isn't clear that either should be used here. UFS_BALLOC() is not passed the full uio info needed to make this caclulation, and it uses the fixed size MAXBSIZE. That is wrong in a different way. This parameter is used to reduce latency. It asks for a small cluster of the specified size followed by read ahead of full clusters, with the amount of read ahead controlled by vfs.read_max. This gives clusters of non-uniform sizes and offsets, especially when the reader uses small blocks. scottl recently added vfs.read_min which can be tuned to prevent this. I don't like many things in this area. vfs.read_min works OK, but is another hack. The default should probably be to optimize for throughput instead of latency (the reverse of the current one, but the curent one is historical so it shouldn't be changed). The units of vfs.read_max and vfs.read_min are fs block sizes. This is quite broken when the cluster size is varied and sometimes small. E.g., the old default read_max of 8, with a block size of 512 then the read-ahead is limited to a whole 4K. The default is 64, which is still too small with small block sizes. But if you increase vfs.read_max a lot, then the read-ahead becomes almost infinity when the block size is large. In my version, the units for vfs.read_max are 512-blocks (default 256 for the old limit of 128K read-ahead with ffs's old default 16K-blocks. The current limit of 64 seems excessive with ffs's current default of 32K-blocks (2048K read-ahead). My units are mostly better, but I just noticed that they have a different too-delicate interaction with application and kernel block sizes... @ +#else @ + MAXPHYS, The above gave sub-maximal clustering. So does ffs's MAXBSIZE when it is smaller than mnt_iosize_max. In ~5.2, mnt_iosize_max is physical and is usually DFLTPHYS == MAXBSIZE, so ffs's choice usually gives maximal clusters. However, in -current, mnt_iosize_max is virtual and is usually MAXPHYS == 2 * MAXBSIZE. So MAXPHYS is probably correct here. ... Then I noticed another problem. MAXPHYS twice mnt_iosize_max, so the cluster size is only mnt_iosize_max = DFLTPHYS = 64K. This apparently acts badly with vfs.read_max = 256 512-blocks. I think it breaks read-ahead. Throughput drops by a factor of 4 for read-before write relative to direct writes (not counting the factor of 2 for the doubled i/o from the reads), although all the i/o sizes are 64K. Increasing vfs.read_max by just 16 fixes this. The throughput drop is then only 10-20% (there must be some drop for the extra seeks). I'm not sure if extra read-ahead is good or bad here. More read-ahead in read-before-write reduces seeks, but it may also break drives' caching and sequential heuristics. My drives are old and have small caches and are very sensitive to the i/o pattern for read-before-write. @ +#endif @ + ioflag >> IO_SEQSHIFT, &bp); @ + } else { @ + error = bread(vp, bn, pmp->pm_bpcluster, @ + NOCRED, &bp); @ + } @ if (error) { @ brelse(bp); To complete getting mostly-full clusters for writing large files to msdosfs, I hacked the block size heuristic some more to give larger blocks: @ diff -u2 msdosfs_vfsops.c~ msdosfs_vfsops.c @ --- msdosfs_vfsops.c~ Sun Jun 20 14:20:03 2004 @ +++ msdosfs_vfsops.c Wed Jun 12 04:32:52 2013 @ @@ -519,7 +547,11 @@ @ @ if (FAT12(pmp)) @ - pmp->pm_fatblocksize = 3 * pmp->pm_BytesPerSec; @ + pmp->pm_fatblocksize = 3 * DEV_BSIZE; @ + else if (FAT16(pmp)) @ + pmp->pm_fatblocksize = PAGE_SIZE; @ else @ - pmp->pm_fatblocksize = MSDOSFS_DFLTBSIZE; @ + pmp->pm_fatblocksize = DFLTPHYS; @ + pmp->pm_fatblocksize = roundup(pmp->pm_fatblocksize, @ + pmp->pm_BytesPerSec); @ @ pmp->pm_fatblocksec = pmp->pm_fatblocksize / DEV_BSIZE; I changed my version this long ago to use 3*DEV_BSIZE for FAT12 and PAGE_SIZE in all other cases. 3*pmp->pm_BytesPerSec is bogus since IIRC a small file system doesn't need even 3 DEV_BSIZE sectors and when the sector size is larger than DEV_BSIZE then it won't need 3 sectors. MSDOSFS_DFLTBSIZE is 4096. This and PAGE_SIZE are really too small for huge FATs. The FAT i/o size should really depend on the size of the FAT, not on its type, but small sizes are more robust and more efficient for sparse writes. The larger size also requires fewer buffers. 4K is not too bad, but 512 would be really bad. I just remembered why I like small blocks. They are more robust, and clustering makes them efficient. But clustering of the FAT isn't done. Clusters are normally written with bdwrite() but not B_CLUSTEROK. I think some clustering still occurs since !B_CLUSTEROK is not honored. Clusters are read using bread(). I think this is followed breadn(), giving the old limited read-ahead which isn't nearly enough with 4K-blocks. DFLTPHYS or MAXPHYS wasn't usable as a default until geom made the max i/o size virtual, since it didn't work for devices with a lower limit. Neither did MAXBSIZE. Even 4K might be larger than the device limit. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130612053543.X900>