Date: Mon, 10 Jun 2013 01:10:00 GMT From: Bruce Evans <brde@optusnet.com.au> To: freebsd-bugs@FreeBSD.org Subject: Re: kern/178997: Heavy disk I/O may hang system Message-ID: <201306100110.r5A1A0FM076378@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/178997; it has been noted by GNATS.
From: Bruce Evans <brde@optusnet.com.au>
To: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Cc: freebsd-gnats-submit@FreeBSD.org
Subject: Re: kern/178997: Heavy disk I/O may hang system
Date: Mon, 10 Jun 2013 11:00:29 +1000 (EST)
On Mon, 10 Jun 2013, Klaus Weber wrote:
> On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote:
>> On Fri, 31 May 2013, Klaus Weber wrote:
This thread is getting very long, and I will only summarize a couple
of things that I found last week here. Maybe more later.
o Everything seems to be working as well as intended (not very well)
except in bufdaemon and friends. Perhaps it is already fixed there.
I forgot to check which version of FreeBSD you are using. You may
be missing some important fixes. There were some by kib@ a few
months ago, and some by jeff@ after this thread started. I don't
run any version of FreeBSD new enough to have these, and the version
that I run also doesn't seem to have any serious bugs in bufdaemon.
It just works mediocrely.
o Writing in blocks of size less than the fs block size, as bonnie
normally does, gives much the same rewriting effect as bonnie does
explicitly, because the system is forced to read each block before
doing a partial write to it. This at best doubles the amount of
i/o and halves the throughput of the writes.
o There are some minor bugs in the read-before-write system code, and
some differences in the "much the same" that may be important in
some cases:
- when bonnie does the read-before-write, the system normally uses
cluster_read() on the read descriptor and thus uses the system's
idea of sequentiality on the read descriptor. For both ffs and
msdosfs, this normally results in a reading small cluster (up to
the end of the current read) followed by async read-ahead of a
larger cluster or 2. The separate clusters improve latency but
reduce performance. scottl@ recently committed a sysctl
vfs.read_min for avoiding the earlier splitting. Using it made
some interesting but ultimately unimportant differences to the
2-bonnie problem.
- when the system does the read-before-write, ffs normally uses
cluster_read() on the write descriptor and thus uses the system's
idea of sequentially on the write descriptor. ffs doesn't know
the correct amount to read in this case, and it always asks
for MAXBSIZE, which is both too small and too large. This value
is the amount that should be read synchronously. It is too large
since the normal amount is the application's block size which is
normally smaller, and it is too small since MAXBSIZE is only
half of the max cluster size. The correct tradeoff of latency
vs throughput is even less clear than for a user read, and further
off from being dynamic. msdosfs doesn't even use cluster_read()
for this (my bad). It uses plain bread(). This gives very low
performance when the block size is small. So msdosfs worked much
better in the 2-bonnie benchmark than for rewrites generated by
dd just writing with a small block size and conv=notrunc. After
fixing thus, msdosfs worked slightly better than ffs in all cases.
- whoever does the read-before-write, cluster reading tends to generate
a bad i/o pattern. I saw patterns like the following (on ~5.2 where
the max cluster size is only 64K, after arranging to mostly use this
size):
file1: read 64K offset 0
file1: read ahead 64K offset 64K
file2: read 64K offset 0
file1: read ahead 64K offset 64K
file1: write 64K offset 0
file1: read 64K offset 128K
file1: read ahead 64K offset 192K
file2: write 64K offset 0
The 2 files make the disk seek a lot, and the read-and-read-ahead
gives even more seeks to get back to the write position. My drives
are old and have only about 2MB of cache. Seeks with patterns like
the above are apparently just large enough to break the drives'
caching. OTOH, if I use your trick of mounting with -noclusterw,
the seeks are reduced signficantly and my throughput increases by
almost a factor of 2, even though this gives writes of only 16K.
Apparently the seeks are reduced just enough for the drives' caches
to work well. I think the same happens for you. Your i/o system
is better, but it only takes a couple of bonnies and perhaps the
read pointers getting even further ahead of the write pointers
to defeat the drive's caching. Small timing differences probably
allow the difference to build up. Mounting with -noclusterw also
gives some synchronization that will prevent this buildup.
- when the system does the read-before-write, the sequential heuristic
isn't necessarily clobbered, but it turns out that the clobbering
gives the best possible behaviour, except for limitations and bugs
in bufdaemon!...
> [... good stuff clipped]
> So it really seems that clustering does provide performance benefits,
> but the RAID controller seems to able to able to make up for the lack
> of clustering (either because clustering is disabled, or because it
> does not work effectively due to interspersed reads and seeks on the
> same file descriptor).
Yes, the seek pattern caused by async-but-not-long-delayed writes
(whether done by cluster_write() a bit later or bawrite() directly)
combined with reading far ahead (whether done explicitly or implicitly)
is very bad even for 1 file, but can often be compensated for by caching
in the drives. With 2 files or random writes on 1 file it is much worse,
but appaerently mounting with -noclusterw limits it enough for the
drives to compensate in the case of 2 bonnies. I think the best we
can do in general is delay writes as long as possible and then
schedule them perfectly. But scheduling them perfectly is difficult
and only happens accidentally.
>>> I am now looking at vfs_cluster.c to see whether I can find which part
>>> is responsible for letting numdirtybuffers raise without bounds and
>>> why only *re* writing a file causes problems, not the initial
>>> writing. Any suggestions on where to start looking are very welcome.
>>
>> It is very complicated, but it was easy to find its comments saying that
>> it tries not to force out the writes for non-seqential accesses. I
>> am currently trying the following workarounds:
>
> I have decided to start testing with only a single change from the
> list of changes you provided:
>
>> % diff -u2 vfs_cluster.c~ vfs_cluster.c
>> % @@ -726,8 +890,13 @@
>> % * are operating sequentially, otherwise let the buf or
>> % * update daemon handle it.
>> % + *
>> % + * Algorithm changeback: ignore seqcount here, at least for
>> % + * now, to work around readers breaking it for writers. It
>> % + * is too late to start ignoring after write pressure builds
>> % + * up, since not writing out here is the main source of the
>> % + * buildup.
>> % */
>> % bdwrite(bp);
>> % - if (seqcount > 1)
>> % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart,
>> vp->v_clen + 1);
>> % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen +
>> 1);
>> % vp->v_clen = 0;
>> % vp->v_cstart = lbn + 1;
>
> And sure enough, a kernel with only this one line change[*] is able to
> handle reading and re-writing a file through a single file descriptor
> just fine, good performance, no hangs, vfs.numdirtybuffers remains
> low:
After more testing, I found that this was almost perfectly backwards for
my hardware! I think for your hardware it allows the drives to
compensate, much like with -noclusterw but with a slightly improved
throughput due to the larger writes. But with my drives, it mostly
just gives more seeks. After changing this back and being more careful
with the comparisons, I found that best results are obtained (in ~5.2)
by letting numdirtybuffers build up. The breakage of the sequential
heuristic cause the above to never force out the cluster immediately
for the 2-bonnie case. I get similar behaviour by always using delayed
writes in ffs_write(). This might depend on setting B_CLUSTEROK in more
cases, so that the clustering always gets done later.
Typical throughputs for me:
- my drives can do 55MB/sec max and get 48 for writing 1 file with large
blocks using dd
- 48 drops to half of 20-24 with read-before-write for 1 file. That's
a 4-fold reduction. One half is for the doubled i/o and the other half
is for the seeks.
- half of 20-24 drops to half of 10-12 with 2 files and read-before-write
of each, in the best case. That's an 8-fold reduction. Another factor
of 2 is apparently lost to more seeks.
- half of 10-12 drops to half of 5-6, as in the previous point but in the
worst case. That's a 16-fold reduction. The worst case is with my
modification above. It maximizes the seeks. My original idea for a
fix (in the above diff) gave this case. It gave almost perfect
clustering and almost no buildup of numdirtybuffers, but throughput
was still worst. (My drives can do 16K blocks at full speed provided
the blocks are contiguous, so they don't benifit much from clustering
except for its side effect of reducing seeks to other blocks in between
accessing the contiguous ones.)
Some of this typical behaviour is not very dependent on block sizes. The
drives become seek-bound, and anything that doubles the number of seeks
halves the throughput.
Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201306100110.r5A1A0FM076378>
