From owner-freebsd-stable@FreeBSD.ORG Tue Mar 5 02:48:37 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 540723B1 for ; Tue, 5 Mar 2013 02:48:37 +0000 (UTC) (envelope-from karl@denninger.net) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) by mx1.freebsd.org (Postfix) with ESMTP id F3D9B9A5 for ; Tue, 5 Mar 2013 02:48:36 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by fs.denninger.net (8.14.6/8.13.1) with ESMTP id r252mZZA038927 for ; Mon, 4 Mar 2013 20:48:35 -0600 (CST) (envelope-from karl@denninger.net) Received: from [127.0.0.1] [192.168.1.40] by Spamblock-sys (LOCAL); Mon Mar 4 20:48:35 2013 Message-ID: <51355CFE.7080405@denninger.net> Date: Mon, 04 Mar 2013 20:48:30 -0600 From: Karl Denninger User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: freebsd-stable@freebsd.org Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults? References: <513524B2.6020600@denninger.net> <89680320E0FA4C0A99D522EA2037CE6E@multiplay.co.uk> In-Reply-To: <89680320E0FA4C0A99D522EA2037CE6E@multiplay.co.uk> X-Enigmail-Version: 1.5 X-Antivirus: avast! (VPS 130304-2, 03/04/2013), Outbound message X-Antivirus-Status: Clean Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Mar 2013 02:48:37 -0000 On 3/4/2013 6:33 PM, Steven Hartland wrote: > What does zfs-stats -a show when your having the stall issue? > > You can also use zfs iostats to show individual disk iostats > which may help identify a single failing disk e.g. > zpool iostat -v 1 > > Also have you investigated which of the two sysctls you changed > fixed it or does it require both? > > Regards > Steve > > ----- Original Message ----- From: "Karl Denninger" > To: > Sent: Monday, March 04, 2013 10:48 PM > Subject: ZFS "stalls" -- and maybe we should be talking about defaults? > > > Well now this is interesting. > > I have converted a significant number of filesystems to ZFS over the > last week or so and have noted a few things. A couple of them aren't so > good. > > The subject machine in question has 12GB of RAM and dual Xeon > 5500-series processors. It also has an ARECA 1680ix in it with 2GB of > local cache and the BBU for it. The ZFS spindles are all exported as > JBOD drives. I set up four disks under GPT, have a single freebsd-zfs > partition added to them, are labeled and the providers are then > geli-encrypted and added to the pool. When the same disks were running > on UFS filesystems they were set up as a 0+1 RAID array under the ARECA > adapter, exported as a single unit, GPT labeled as a single pack and > then gpart-sliced and newfs'd under UFS+SU. > > Since I previously ran UFS filesystems on this config I know what the > performance level I achieved with that, and the entire system had been > running flawlessly set up that way for the last couple of years. > Presently the machine is running 9.1-Stable, r244942M > > Immediately after the conversion I set up a second pool to play with > backup strategies to a single drive and ran into a problem. The disk I > used for that testing is one that previously was in the rotation and is > also known good. I began to get EXTENDED stalls with zero I/O going on, > some lasting for 30 seconds or so. The system was not frozen but > anything that touched I/O would lock until it cleared. Dedup is off, > incidentally. > > My first thought was that I had a bad drive, cable or other physical > problem. However, searching for that proved fruitless -- there was > nothing being logged anywhere -- not in the SMART data, not by the > adapter, not by the OS. Nothing. Sticking a digital storage scope on > the +5V and +12V rails didn't disclose anything interesting with the > power in the chassis; it's stable. Further, swapping the only disk that > had changed (the new backup volume) with a different one didn't change > behavior either. > > The last straw was when I was able to reproduce the stalls WITHIN the > original pool against the same four disks that had been running > flawlessly for two years under UFS, and still couldn't find any evidence > of a hardware problem (not even ECC-corrected data returns.) All the > disks involved are completely clean -- zero sector reassignments, the > drive-specific log is clean, etc. > > Attempting to cut back the ARECA adapter's aggressiveness (buffering, > etc) on the theory that I was tickling something in its cache management > algorithm that was pissing it off proved fruitless as well, even when I > shut off ALL caching and NCQ options. I also set > vfs.zfs.prefetch_disable=1 to no effect. Hmmmm... > > Last night after reading the ZFS Tuning wiki for FreeBSD I went on a > lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set > vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted. /* > > The problem instantly disappeared and I cannot provoke its return even > with multiple full-bore snapshot and rsync filesystem copies running > while a scrub is being done.*/ > /**/ > I'm pinging between being I/O and processor (geli) limited now in normal > operation and slamming the I/O channel during a scrub. It appears that > performance is roughly equivalent, maybe a bit less, than it was with > UFS+SU -- but it's fairly close. > > The operating theory I have at the moment is that the ARC cache was in > some way getting into a near-deadlock situation with other memory > demands on the system (there IS a Postgres server running on this > hardware although it's a replication server and not taking queries -- > nonetheless it does grab a chunk of RAM) leading to the stalls. > Limiting its grab of RAM appears to have to resolved the contention > issue. I was unable to catch it actually running out of free memory > although it was consistently into the low five-digit free page count and > the kernel never garfed on the console about resource exhaustion -- > other than a bitch about swap stalling (the infamous "more than 20 > seconds" message.) Page space in use near the time in question (I could > not get a display while locked as it went to I/O and froze) was not > zero, but pretty close to it (a few thousand blocks.) That the system > was driven into light paging does appear to be significant and > indicative of some sort of memory contention issue as under operation > with UFS filesystems this machine has never been observed to allocate > page space. > > Anyone seen anything like this before and if so.... is this a case of > bad defaults or some bad behavior between various kernel memory > allocation contention sources? > > This isn't exactly a resource-constrained machine running x64 code with > 12GB of RAM and two quad-core processors in it! > I caught it with systat -vm running (which displays raw io stats on the bottom) and when it locks there's no I/O to ANY spindle (there are six online spindles in the box plus two backup volumes that are normally unmounted.) Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single "da0" drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. You can't do anything while it's frozen -- anything that wants I/O hangs until it unfreezes. I have zero errors logged in the OS pack for both drives in that mirror as well and again none in the RAID adapter either. I'm not sure which tunable stopped it as I changed both at the same time. Unfortunately both of those tunables can only be changed in /boot/loader.conf, not dynamically, so trying to figure out where the wall is on this is going to be a lot of fun. This is a machine I can futz with provided that I give reasonable notice and it's off-hours; it has a sister system that I can play with at will up to and including destroying it and I'm going to take one of the backup volumes, detach it and use that as a "seed" to effectively replicate the environment on the other box and see if I can isolate this. I've got close to a dozen machines in this basic configuration in the field; they're slightly-older Xeon-series CPUs but work exceptionally well -- this is my first foray into zfs and I need to understand what's going on as stalls like this in production are not good for obvious reasons. -- -- Karl Denninger /The Market Ticker ®/ Cuda Systems LLC