From owner-freebsd-stable@FreeBSD.ORG Tue Mar 5 03:52:51 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 0C6D6214 for ; Tue, 5 Mar 2013 03:52:51 +0000 (UTC) (envelope-from dg@pki2.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id 9DC53D52 for ; Tue, 5 Mar 2013 03:52:50 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.6/8.14.5) with ESMTP id r253qZIx079117; Mon, 4 Mar 2013 19:52:35 -0800 (PST) (envelope-from dg@pki2.com) Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults? From: Dennis Glatting To: Karl Denninger In-Reply-To: <51355F64.4040409@denninger.net> References: <513524B2.6020600@denninger.net> <1362449266.92708.8.camel@btw.pki2.com> <51355F64.4040409@denninger.net> Content-Type: text/plain; charset="ISO-8859-1" Date: Mon, 04 Mar 2013 19:52:35 -0800 Message-ID: <1362455555.62624.11.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: r253qZIx079117 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: dg@pki2.com Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Mar 2013 03:52:51 -0000 On Mon, 2013-03-04 at 20:58 -0600, Karl Denninger wrote: > Stick this in /boot/loader.conf and see if your lockups goes away: > > vfs.zfs.write_limit_override=1024000000 > K. > I've got a "sentinal" running that watches for zero-bandwidth zpool > iostat 5s that has been running for close to 12 hours now and with the > two tunables I changed it doesn't appear to be happening any more. > I've also done this as well as top and systat -vmstat. Disk I/O stops but the system lives through top, system, and the network. However, if I try to login the login won't complete. All of my systems are hardware RAID1 for the OS (LSI and Areca) and typically a separate disk for swap. All other disks are ZFS. > This system always has small-ball write I/Os going to it as it's a > postgresql "hot standby" mirror backing a VERY active system and is > receiving streaming logdata from the primary at a colocation site, so > the odds of it ever experiencing an actual zero for I/O (unless there's > a connectivity problem) is pretty remote. > I am doing multi TB sorts and GB database loads. > If it turns out that the write_limit_override tunable is the one > responsible for stopping the hangs I can drop the ARC limit tunable > although I'm not sure I want to; I don't see much if any performance > penalty from leaving it where it is and if the larger cache isn't > helping anything then why use it? I'm inclined to stick an SSD in the > cabinet as a cache drive instead of dedicating RAM to this -- even > though it's not AS fast as RAM it's still MASSIVELY quicker than getting > data off a rotating plate of rust. > I forgot to mention that on my three 8.3 systems they occasionally offline a disk (one or two a week, total). I simply online the disk and after resilver all is well. There are ~40 disks across those three systems. Of my 9.1 systems three are busy but with smaller number of disks (about eight across two volumes (RAIDz2 and mirror). I also have a ZFS-on-Linux (CentOS) system for play (about 12 disks). It did not exhibit problems when it was in use but it did teach me a lesson on the evils of dedup. :) > Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all? > Dunno. > On 3/4/2013 8:07 PM, Dennis Glatting wrote: > > I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25% > > ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips > > without much load. Interestingly pbzip2 consistently created a problem > > on a volume whereas gzip does not. > > > > Here, stalls happen across several systems however I have had less > > problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same > > chips: IR vs IT) I don't have a problem. > > > > > > > > > > On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote: > >> Well now this is interesting. > >> > >> I have converted a significant number of filesystems to ZFS over the > >> last week or so and have noted a few things. A couple of them aren't so > >> good. > >> > >> The subject machine in question has 12GB of RAM and dual Xeon > >> 5500-series processors. It also has an ARECA 1680ix in it with 2GB of > >> local cache and the BBU for it. The ZFS spindles are all exported as > >> JBOD drives. I set up four disks under GPT, have a single freebsd-zfs > >> partition added to them, are labeled and the providers are then > >> geli-encrypted and added to the pool. When the same disks were running > >> on UFS filesystems they were set up as a 0+1 RAID array under the ARECA > >> adapter, exported as a single unit, GPT labeled as a single pack and > >> then gpart-sliced and newfs'd under UFS+SU. > >> > >> Since I previously ran UFS filesystems on this config I know what the > >> performance level I achieved with that, and the entire system had been > >> running flawlessly set up that way for the last couple of years. > >> Presently the machine is running 9.1-Stable, r244942M > >> > >> Immediately after the conversion I set up a second pool to play with > >> backup strategies to a single drive and ran into a problem. The disk I > >> used for that testing is one that previously was in the rotation and is > >> also known good. I began to get EXTENDED stalls with zero I/O going on, > >> some lasting for 30 seconds or so. The system was not frozen but > >> anything that touched I/O would lock until it cleared. Dedup is off, > >> incidentally. > >> > >> My first thought was that I had a bad drive, cable or other physical > >> problem. However, searching for that proved fruitless -- there was > >> nothing being logged anywhere -- not in the SMART data, not by the > >> adapter, not by the OS. Nothing. Sticking a digital storage scope on > >> the +5V and +12V rails didn't disclose anything interesting with the > >> power in the chassis; it's stable. Further, swapping the only disk that > >> had changed (the new backup volume) with a different one didn't change > >> behavior either. > >> > >> The last straw was when I was able to reproduce the stalls WITHIN the > >> original pool against the same four disks that had been running > >> flawlessly for two years under UFS, and still couldn't find any evidence > >> of a hardware problem (not even ECC-corrected data returns.) All the > >> disks involved are completely clean -- zero sector reassignments, the > >> drive-specific log is clean, etc. > >> > >> Attempting to cut back the ARECA adapter's aggressiveness (buffering, > >> etc) on the theory that I was tickling something in its cache management > >> algorithm that was pissing it off proved fruitless as well, even when I > >> shut off ALL caching and NCQ options. I also set > >> vfs.zfs.prefetch_disable=1 to no effect. Hmmmm... > >> > >> Last night after reading the ZFS Tuning wiki for FreeBSD I went on a > >> lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set > >> vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted. /* > >> > >> The problem instantly disappeared and I cannot provoke its return even > >> with multiple full-bore snapshot and rsync filesystem copies running > >> while a scrub is being done.*/ > >> /**/ > >> I'm pinging between being I/O and processor (geli) limited now in normal > >> operation and slamming the I/O channel during a scrub. It appears that > >> performance is roughly equivalent, maybe a bit less, than it was with > >> UFS+SU -- but it's fairly close. > >> > >> The operating theory I have at the moment is that the ARC cache was in > >> some way getting into a near-deadlock situation with other memory > >> demands on the system (there IS a Postgres server running on this > >> hardware although it's a replication server and not taking queries -- > >> nonetheless it does grab a chunk of RAM) leading to the stalls. > >> Limiting its grab of RAM appears to have to resolved the contention > >> issue. I was unable to catch it actually running out of free memory > >> although it was consistently into the low five-digit free page count and > >> the kernel never garfed on the console about resource exhaustion -- > >> other than a bitch about swap stalling (the infamous "more than 20 > >> seconds" message.) Page space in use near the time in question (I could > >> not get a display while locked as it went to I/O and froze) was not > >> zero, but pretty close to it (a few thousand blocks.) That the system > >> was driven into light paging does appear to be significant and > >> indicative of some sort of memory contention issue as under operation > >> with UFS filesystems this machine has never been observed to allocate > >> page space. > >> > >> Anyone seen anything like this before and if so.... is this a case of > >> bad defaults or some bad behavior between various kernel memory > >> allocation contention sources? > >> > >> This isn't exactly a resource-constrained machine running x64 code with > >> 12GB of RAM and two quad-core processors in it! > >> > > > > _______________________________________________ > > freebsd-stable@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > > > > > > %SPAMBLOCK-SYS: Matched [@freebsd.org+], message ok > -- Dennis Glatting