Date: Wed, 14 Apr 2010 16:38:28 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Andriy Gapon <avg@FreeBSD.org> Cc: arch@FreeBSD.org, Rick Macklem <rmacklem@uoguelph.ca> Subject: Re: (in)appropriate uses for MAXBSIZE Message-ID: <20100414144336.L12587@delplex.bde.org> In-Reply-To: <4BC34402.1050509@freebsd.org> References: <4BBEE2DD.3090409@freebsd.org> <Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <4BC34402.1050509@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 12 Apr 2010, Andriy Gapon wrote: > on 11/04/2010 05:56 Bruce Evans said the following: >> On Fri, 9 Apr 2010, Andriy Gapon wrote: > [snip] >>> I have lightly tested this under qemu. >>> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. >>> I removed size > MAXBSIZE check in getblk (see a parallel thread >>> "panic: getblk: >>> size(%d) > MAXBSIZE(%d)"). >> >> Did you change the other known things that depend on this? There is the >> b_pages limit of MAXPHYS bytes which should be checked for in another >> way > > I changed the check the way I described in the parallel thread. I didn't notice anything there about checking MAXPHYS instead of MAXBSIZE. Was an explicit check needed? (An implicit check would probably have worked: most clients were limited by the MAXBSIZE check, and the pbuf client always uses MAXPHYS or DFLTPHYS.) >> and the soft limits for hibufspace and lobufspace which only matter >> under load conditions. > > And what these should be? > hibufspace and lobufspace seem to be auto-calculated. One thing that I noticed > and that was a direct cause of the problem described below, is that difference > between hibufspace and lobufspace should be at least the maximum block size > allowed in getblk() (perhaps it should be strictly equal to that value?). > So in my case I had to make that difference MAXPHYS. Hard to say. They are mostly only heuristics which mostly only matter under heavy loads. You can change the defaults using sysctl but it is even harder to know what changes might be good without knowing the details of the implementation. >>> And I bumped MAXPHYS to 1MB. >>> ... >>> But I observed reading process (dd bs=1m on avgfs) spending a lot of >>> time sleeping >>> on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. >>> Apparently there was some shortage of free buffers. >>> Perhaps some limits/counts were incorrectly auto-tuned. >> >> This is not surprising, since even 64K is 4 times too large to work >> well. Buffer sizes of larger than BKVASIZE (16K) always cause >> fragmentation of buffer kva. ... > > So, BKVASIZE is the best read size from the point of view of buffer space usage? It is the best buffer size, which is almost independent of the best read size. First, userland reads will be re-blocked into file-system-block-size reads... > E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but > leads to buffer space map fragmentation, because of size > BKVASIZE. > On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from > buffer space point of view (no fragmentation potential) but they result in 4 GEOM Clustering occurs above geom, so geom only sees small requests for small files, random accesses, and buggy cases for sequential accesses to large files where the bugs give partial randomness. E.g., a single 64K read from userland normally gives 4 16K ffs blocks in the buffer cache. Clustering turns these into 1 128K block in a pbuf (64K for the amount read now and 128K for read-ahead; there may be more read-ahead but it would go in another pbuf). geom then sees the 128K (MAXPHYS) block. Most device drivers still only support i/o's of size <= DFLTPHYS, but geom confuses the clustering code into producing clusters larger than its intended maximum of what the device supports by advertising support for MAXPHYS (v_mount->mnt_iosize_max). So geom normally turns the 128K request into 2 64K requests. Clustering finishes by converting the 128K request into 8 16K requests (4 for use now and 4 later for read-ahead). OTOH, the first block of 4 sequential reads of 16K produces the same 128K block at the geom level, modulo bugs in the read-ahead. This now consists 1 and 7 blocks of normal read and read-ahead, respectively, instead of 4 and 4. Then the next 3 blocks are found in the buffer cache as read-ahead instead of read from the disk (actually, this is insignificantly different from the first case after ffs splits up the 64K into 4 times 16K). So the block size makes almost no difference at the syscall level (512-blocks take significantly more CPU but improve latency, while hige blocks take significantly less CPU but significantly unimprove latency). The file system block size makes only secondary differences: - clustering only works to turn small logical i/o's into large physical ones when sequential blocks are allocated sequentially, but always allocating blocks sequentially is hard to do and using large file system blocks reduces the loss when the allocation is not sequential - large file system blocks also reduce the amount of work that clustering has to do to reblock. This benefit is much smaller than the previous one. - the buffer cache is only designed to handle medium-sized blocks well. With 512-blocks, it can only hold 1/32 as much as with 16K-blocks, so it will thrash 32 times as much with the former. Now that the thrashing is to VMIO instead of to the disk, this only wastes CPU. With any block size larger than BKVASIZE, the buffer cache may become fragmented, depending on the combination of block sizes. Mixed combinations are the worst, and the system doesn't do anything to try to avoid them. The worst case is a buffer cache full of 512-blocks, with getblk() wanting to allocate a 64K-block. Then it needs to wait for 32 contiguous blocks to become free, or forcibly free some, or move some... > I/O requests. > The thing is that a single read requires a single contiguous virtual address space > chunk. Would it be possible to take the best of both worlds by somehow allowing a > single large I/O request to work with several buffers (with b_kvasize == BKVASIZE) > in a iovec-like style? > Have I just reinvented bicycle? :) > Probably not, because an answer to my question is probably 'not (without lots of > work in lots of places)' as well. Separate buffers already partly provided, this, and combined with command queuing in the hardware they provided it completely in perhaps a better way than can be done in software. vfs clustering attempts much less but still complicated. It mainly wants to convert buffers that have contiguous disk addresses into a super-buffer that has contiguous virtual memory and combine this with read-ahead, to reduce the number of i/o's. All drives less than 10 years old benefit only marginally from this, since the same cases that vfs clustering can handle are also easy for drive clustering, caching and read-ahead/write- behind (especially the latter) to handle even better, so I occasionally try turning off vfs clustering to see if it makes a difference; unfortunately it still seems to help on all drives, including even reducing total CPU usage despite its own large CPU usage. > I see that breadn() certainly doesn't work that way. As I understand, it works > like bread() for one block plus starts something like 'asynchronous breads()' for > a given count of other blocks. Usually breadn() isn't called, but clustering reads to the end of the current cluster or maybe the next cluster. breadn() was designed when reading ahead a single cluster was helpful. Now, drives read-ahead a whole track or similar probably hundreds of sectors, so reading ahead a single sector is almost useless. It doesn't even reduce the number of i/o's unless it is clustered with the previous i/o. > I am not sure about details of how cluster_read() works, though. > Could you please explain the essence of it? See above. It is essentially the old hack of reading ahead a whole track in software, done in a sophisticated way but with fewer attempts to satisfy disk geometry timing requirements. Long ago, everything was so slow that sequential reads done from userland could not keep up with even a floppy disk, but sequential i/o's done from near the driver could, even with i/o's of only 1 sector. I've only ever seen this working well for floppy disks. For hard disks, the i/o's need to be multi-sector, and needed to be related to the disk geometry (handle full tracks and don't keep results from intermediate sectors that are not needed yet iff doing so wouldn't thrash the cache). Now, it is unreasonable to try to know the disk geometry, and vfs clustering doesn't try. Fortunately, this is not much needed, since newer drives have their own track caches which, although they don't form a complete replacement for vfs clustering (see above), they reduces the losses to extra non-physical reads. Similarly for another problem with vfs: all buffers and their clustering are file (vnode) based, which almost forces missing intermediate sectors when reading a file, but a working device (track or similar) in the drive mostly compensates for not having one in the OS. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100414144336.L12587>