From owner-freebsd-bugs@FreeBSD.ORG Mon May 27 05:58:12 2013 Return-Path: Delivered-To: freebsd-bugs@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D8E4732A; Mon, 27 May 2013 05:58:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail12.syd.optusnet.com.au (mail12.syd.optusnet.com.au [211.29.132.193]) by mx1.freebsd.org (Postfix) with ESMTP id 47CA72EB; Mon, 27 May 2013 05:58:11 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail12.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r4R5vuXh008448 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 May 2013 15:57:59 +1000 Date: Mon, 27 May 2013 15:57:56 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Klaus Weber Subject: Re: kern/178997: Heavy disk I/O may hang system In-Reply-To: <201305261951.r4QJpn9Z071712@oldred.FreeBSD.org> Message-ID: <20130527135103.H919@besplex.bde.org> References: <201305261951.r4QJpn9Z071712@oldred.FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=e4Ne0tV/ c=1 sm=1 a=DVepXb5dPVYA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=RYM8TsqsnA4A:10 a=9I5xiGouAAAA:8 a=UDUS4abV8DmR6iv2LHwA:9 a=CjuIK1q_8ugA:10 a=w_wZMZHZOvwA:10 a=I1nD5YayuJa7qO2X:21 a=ViOwaCFKTo3CuOS0:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: freebsd-bugs@FreeBSD.org, freebsd-gnats-submit@FreeBSD.org X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 May 2013 05:58:12 -0000 On Sun, 26 May 2013, Klaus Weber wrote: >> Environment: > FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013 root@filepile:/usr/obj/usr/src/sys/FILEPILE amd64 > (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well) >> Description: > During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase. Please use the unix newline character in mail. A huge drop in disk throughput is normal with multiple processes, unfortunately. Perhaps not quite this much. > Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers. Hangs are bugs though. I have been arguing with kib@ about some methods of handling heavy disk i/o being nonsense since they either make the problem worse (by switching to direct unclustered writes, so that slow i/o goes even slower) or have no effect except to complicate the analysis of the problem (because they are old hackish methods, and newer better methods make the cases handled by the old methods unreachable). But the analysis is complicated, and we couldn't agree on anything. One thing I noticed more recently than my last mail to kib about this is that even a single heavy disk write (say dd bs=256m count=2 foo) can block another non-write-bound process (say iostat 1 or systat -v 1 to watch the progress) from starting up (sometimes even when it has just been run to load it into cache(s)) until the dd completes 10 seconds later. This was on a 2-core system running a fairly old version of FreeBSD (8.0-CURRENT from ~2008). I found that the problem could be fixed by killing cluster_write() by turning it into bdwrite() (by editing the running kernel using ddb, since this is easier than rebuilding the kernel). I was trying many similar things since I had a theory that cluster_write() is useless. Actually, cluster_write() gives some minor optimizations, but this misbehaviour indicates that it has a resource starvation bug. My theory for what the bug is is that cluster_write() and cluster_read() share the limit resource of pbufs. pbufs are not managed as carefully as normal buffers. In particular, there is nothing to limit write pressure from pbufs like there is for normal buffers. However, this behaviour doesn't happen on the much older version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many local modifications). It starts iostat to watch dd instantly, even running on a single core system. > gpart create -s gpt da0 > gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0 > newfs -b 64k -f 8k /dev/da0p1 > mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2} The default for newfs is -b 32k. This asks for buffer cache fragmentation. Someone increased the default from 16k to 32k without changing the buffer cache's preferred size (BKVASIZE = 16K). BKVASIZE has always been too small, but on 32-bit arches kernel virtual memory is too limited to have a larger BKVASIZE by default. BKVASIZE is still 16K on all arches although this problem doesn't affetc 64-bit arches. -b 64k is worse. > When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active. You got the buffer cache fragmentation that you asked for. Buffer cache fragmentation caused serious problems 10-20 years ago, but the ones causing hangs were supposed to be fixed about 12 years ago and the ones causing extreme slowness for just the software parts of handling fragmentation were fixed a little later. But there apparently still some. However, I have never been able to reproduce serious fragmentation problems from using too-large-block sizes, or demonstrate significant improvements from avoiding the known fragmentation problem by increasing BKVASIZE. Perhaps my systems are too small, or have tuning or local changes that accidentally avoid the problem. Apparently you found a way to reproduce the serious fragmentaion problems. Try using a block size that doesn't ask for the problem. Increasing BKVASIZE would take more work than this, since although it was intended to be a system parameter which could be changed to reduce the fragmentation problem, one of the many bugs in it is that it was never converted into a "new" kernel option. Another of the bugs in it is that doubling it halves the number of buffers, so doubling it does more than use twice as much kva. This severely limited the number of buffers back when memory sizes were 64MB. It is not a very significant limitation if the memory size is 1GB or larger. > I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk: I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to a nearly full old ffs file system on an old PATA disk, and 2 times 20MB/sec with 2 dd's. This is is almost as good as possible. The configuration was somewhat flawed: - the writes all fitted in memory because my disks are too full to hold more. This shouldn't be a problem for write benchmarks, because write pressure is (too) limited by writing out dirty buffers (too) early to limit write pressure, so writes are mostly physical. - the fs block size was 16K. This not a flaw, just the right size, but other sizes should be tested. - the file system was mounted async. This should have little affect, since for large writes all data is written async anyway (the write pressure handling (causes async writes instead of delayed writes even for not so large writes). - ffs was modified to remove its bogus write pressure handling. This should have no affect, but I didn't verify it. > I have, however, logged the sysctl vas subtree every second while the problem occurs: > As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted: > > 00-03-28.log:vfs.numdirtybuffers: 107 > 00-03-29.log:vfs.numdirtybuffers: 106 > 00-03-30.log:vfs.numdirtybuffers: 105 > ... > [first bonnie++ starts Rewriting] > 00-03-41.log:vfs.numdirtybuffers: 294 > 00-03-42.log:vfs.numdirtybuffers: 404 > 00-03-43.log:vfs.numdirtybuffers: 959 > 00-03-44.log:vfs.numdirtybuffers: 1071 There are various write pressure handling mechanisms that have different thresholds and get in each way a bit. The dirty buffer watermarks are presumably much higher than 105 or even 1071. Other mechanisms apparently work well and limit the number of dirty buffers to only 107 with 1 writer. Writes to memory can go much faster than writes to the disk even when the latter go at full speed, but some mechanism blocks writers (too) early so the 1 writer soon reaches a steady state with not many dirty buffers. > 00-03-45.log:vfs.numdirtybuffers: 1155 > 00-03-46.log:vfs.numdirtybuffers: 1253 > 00-03-47.log:vfs.numdirtybuffers: 1366 > 00-03-48.log:vfs.numdirtybuffers: 1457 > ... > 00-04-21.log:vfs.numdirtybuffers: 8156 > 00-04-22.log:vfs.numdirtybuffers: 8362 > 00-04-23.log:vfs.numdirtybuffers: 8558 > [second bonnie goes Rewriting as well] > 00-04-24.log:vfs.numdirtybuffers: 11586 > 00-04-25.log:vfs.numdirtybuffers: 16325 > 00-04-26.log:vfs.numdirtybuffers: 24333 > ... > 00-04-54.log:vfs.numdirtybuffers: 52096 > 00-04-57.log:vfs.numdirtybuffers: 52098 > 00-05-00.log:vfs.numdirtybuffers: 52096 > [ etc. ] This is a rather large buildup and may indicate a problem. Try reducing the dirty buffer watermarks. Their default values are mostly historical nonsense. They are just some fixed fraction (about 1/4) of the number of buffers. With multiple GB of memory, the number of buffers may be huge. On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh is 35523. But the size of main memory has nothing to do with the correct limit on the number of dirty buffers (unless the latter is not being used to limit write pressure, but only to limit starvation of buffers for readers. This is perhaps all it is supposed to do, but see below). We want buildups of dirty buffers to be limited so that the system never gets too far behind on physical writes. The correct limit depends on the physical write capabilities of the machine and on which write channels have pressure on them. Default global limits don't work at all for this. You can fix the defaults but you can't distribute the limits properly accross channels, or easily keep the limits up to date as channels are added. It is easy to add slow USB channels. These won't increase the write capability much, but it is very easy for a slow disk or channel to it to grab many dirty buffers and keep them for a long time so that faster disks are starved. Other mechanisms handle write pressure better than the simple dirty buffer count one can (others count the amount of i/o, which is more relevant to how long it will take). So assume that any affect of the dirty buffer count mechanism on write pressure is unintentional. There are still the following problems with it: - as above, slow channels can starve fast channels, thus giving a severe side effect on the write pressure (at best the users of the fast channels block so that the are slowed down to the speed of the slow channels) - buffer starvation for readers can happen anyway. vfs.nbuf is the number of buffers that are allocated, but not all of these can be used when some file systems use buffer sizes larger than about BKVASIZE. The worst case is when all file systems a block size of 64KB. Then only about 1/4 of the allocated buffers can be used. The number that can be used is limited by vfs.maxbufspace and related variables. The default for vfs.maxbufspace is controlled indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf = 128K; the ratio is 16K = BKVASIZE). So when you ask for buffer cache fragmentation with -b 64k, you get even more. The effective nbuf is reduced by a factor of 4. This means that all the dirty buffer count watermarks except one are slightly above the effective nbuf, so if the higher watermarks cannot quite be reached and when they are nearly reached readers are starved of buffers. > vfs.hidirtybuffers: 52139 > vfs.lodirtybuffers: 26069 > (the machine has 32 GB RAM) Fairly consistent with my examples of a machine with 24GB RAM. The upper watermark is 33% higher, etc. Your nbuf will be more like 200K than 150K. You reached numdirtybuffers = 52096. At 64K each, that's about 3GB. maxbufspace will be about the same, and bufspace (the amount used) will be not much smaller. There will be little space for buffers for readers, and the 2 writers apparently manage to starve even each other, perhaps for similar reasons. The backlog of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec it would take almost an hour, so it is not much better than a hang. > I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one). I think the hardware is usually not the problem (at least after kib's fix for the above), except when it is large it can stress the software more. Bruce