Date: Fri, 8 Mar 2013 19:03:52 -0800 (PST) From: Don Lewis <truckman@FreeBSD.org> To: lev@FreeBSD.org Cc: freebsd-fs@FreeBSD.org, ivoras@FreeBSD.org, freebsd-geom@FreeBSD.org Subject: Re: Unexpected SU+J inconsistency AGAIN -- please, don't shift topic to ZFS! Message-ID: <201303090303.r2933qqJ032330@gw.catspoiler.org> In-Reply-To: <1402477662.20130306165337@serebryakov.spb.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On 6 Mar, Lev Serebryakov wrote: > Hello, Don. > You wrote 6 марта 2013 г., 14:01:08: > >>> DL> With NCQ or TCQ, the drive can have a sizeable number of writes >>> DL> internally queued and it is free to reorder them as it pleases even with >>> DL> write caching disabled, but if write caching is disabled it has to delay >>> DL> the notification of their completion until the data is on the platters >>> DL> so that UFS+SU can enforce the proper dependency ordering. >>> But, again, performance would be terrible :( I've checked it. On >>> very sparse multi-threaded patterns (multiple torrents download on >>> fast channel in my simple home case, and, I think, things could be >>> worse in case of big file server in organization) and "simple" SATA >>> drives it significant worse in my experience :( > > DL> I'm surprised that a typical drive would have enough onboard cache for > DL> write caching to help signficantly in that situation. Is the torrent > It is 5x64MiB in my case, oh, effectively, 4x64MiB :) > Really, I could repeat experiment with some predictable and > repeatable benchmark. What in out ports could be used for > massively-parallel (16+ files) random (with blocks like 64KiB and > file sizes like 2+GiB) but "repeatable" benchmark? I don't happen to know of any benchmark software in ports for this, but I haven't really looked. > DL> software doing a lot of fsync() calls? Those would essentially turn > Nope. It trys to avoid fsync(), of course > > DL> Creating a file by writing it in random order is fairly expensive. Each > DL> time a new block is written by the application, UFS+SU has to first find > DL> a free block by searching the block bitmaps, mark that block as > DL> allocated, wait for that write of the bitmap block to complete, write > DL> the data to that block, wait for that to complete, and then write the > DL> block pointer to the inode or an indirect block. Because of the random > DL> write ordering, there is probably not enough locality to do coalesce > DL> multiple updates to the bitmap and indirect blocks into one write before > DL> the syncer interval expires. These operations all happen in the > DL> background after the write() call, but once you hit the I/O per second > DL> limit of the drive, eventually enough backlog builds to stall the > DL> application. Also, if another update needs to be done to a block that > DL> the syncer has queued for writing, that may also cause a stall until the > DL> write completes. If you hack the torrent software to create and > DL> pre-zero each file before it starts downloading it, then each bitmap and > DL> indirect block will probably only get written once during that operation > DL> and won't get written again during the actual download, and zeroing the > DL> data blocks will be sequential and fast. During the download, the only > DL> writes will be to the data blocks, so you might see something like a 3x > DL> performance improvement. > My client (transmission, from ports) is configured to do "real > preallocation" (not sparse one), but it doesn't help much. It surely > limited by disk I/O :( > But anyway, torrent client is bad benchmark if we start to speak > about some real experiments to decide what could be improved in > FFS/GEOM stack, as it is not very repeatable. I seem to recall that you mentioning that the raid5 geom layer is doing a lot of caching, presumably to coalesce writes. If this causes the responses to writes to be delayed too much, then the geom layer could end up starved for writes because the vfs.hirunningspace limit will be reached. If this happens, you'll see threads waiting on wdrain. You could also monitor vfs.runningbufspace to see how close it is getting to the limit. If this is the problem, you might want to try cranking up the value of vfs.hirunningspace to see if it helps. One thing that doesn't seem to fit this theory is that if the raid5 layer is doing a lot of caching to try to do write coalescing, then I wouldn't expect that the extra write completion latency caused by turning off write caching in the drives would make much of a difference. Another possibility is that you might be running into the 32 NCQ command limit when write caching is off. With write caching on, then you can probably shove a lot more write commands into the drive before being blocked. That might help the drive get a bit higher iops, but I wouldn't expect a big difference. It could also be that when you hit the limit that you end up blocking read commands from getting sent to the drives, which causes whatever is depending on the data to stall. The gstat command allows the queue length and number of reads and writes to be monitored, but I don't know of a way to monitor the number of read and write commands that the drive has in its internal queue. Something else to look at is what problems might the delayed write completion notifications from the drives cause in the raid5 layer itself. Could that be preventing the raid5 layer from sending other I/O commands to the drives? Between the time a write command has been sent to a drive and the drive reports the completion of the write, what happens if something wants to touch that buffer? What size writes does the application typically do? What is the UFS blocksize? What is the raid5 stripe size? With this access pattern, you may get poor results if the stripe size is much greater than the block and write sizes.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201303090303.r2933qqJ032330>