Date: Mon, 04 Mar 2013 18:07:46 -0800 From: Dennis Glatting <freebsd@pki2.com> To: Karl Denninger <karl@denninger.net> Cc: freebsd-stable@freebsd.org Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults? Message-ID: <1362449266.92708.8.camel@btw.pki2.com> In-Reply-To: <513524B2.6020600@denninger.net> References: <513524B2.6020600@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25% ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips without much load. Interestingly pbzip2 consistently created a problem on a volume whereas gzip does not. Here, stalls happen across several systems however I have had less problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same chips: IR vs IT) I don't have a problem. On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote: > Well now this is interesting. > > I have converted a significant number of filesystems to ZFS over the > last week or so and have noted a few things. A couple of them aren't so > good. > > The subject machine in question has 12GB of RAM and dual Xeon > 5500-series processors. It also has an ARECA 1680ix in it with 2GB of > local cache and the BBU for it. The ZFS spindles are all exported as > JBOD drives. I set up four disks under GPT, have a single freebsd-zfs > partition added to them, are labeled and the providers are then > geli-encrypted and added to the pool. When the same disks were running > on UFS filesystems they were set up as a 0+1 RAID array under the ARECA > adapter, exported as a single unit, GPT labeled as a single pack and > then gpart-sliced and newfs'd under UFS+SU. > > Since I previously ran UFS filesystems on this config I know what the > performance level I achieved with that, and the entire system had been > running flawlessly set up that way for the last couple of years. > Presently the machine is running 9.1-Stable, r244942M > > Immediately after the conversion I set up a second pool to play with > backup strategies to a single drive and ran into a problem. The disk I > used for that testing is one that previously was in the rotation and is > also known good. I began to get EXTENDED stalls with zero I/O going on, > some lasting for 30 seconds or so. The system was not frozen but > anything that touched I/O would lock until it cleared. Dedup is off, > incidentally. > > My first thought was that I had a bad drive, cable or other physical > problem. However, searching for that proved fruitless -- there was > nothing being logged anywhere -- not in the SMART data, not by the > adapter, not by the OS. Nothing. Sticking a digital storage scope on > the +5V and +12V rails didn't disclose anything interesting with the > power in the chassis; it's stable. Further, swapping the only disk that > had changed (the new backup volume) with a different one didn't change > behavior either. > > The last straw was when I was able to reproduce the stalls WITHIN the > original pool against the same four disks that had been running > flawlessly for two years under UFS, and still couldn't find any evidence > of a hardware problem (not even ECC-corrected data returns.) All the > disks involved are completely clean -- zero sector reassignments, the > drive-specific log is clean, etc. > > Attempting to cut back the ARECA adapter's aggressiveness (buffering, > etc) on the theory that I was tickling something in its cache management > algorithm that was pissing it off proved fruitless as well, even when I > shut off ALL caching and NCQ options. I also set > vfs.zfs.prefetch_disable=1 to no effect. Hmmmm... > > Last night after reading the ZFS Tuning wiki for FreeBSD I went on a > lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set > vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted. /* > > The problem instantly disappeared and I cannot provoke its return even > with multiple full-bore snapshot and rsync filesystem copies running > while a scrub is being done.*/ > /**/ > I'm pinging between being I/O and processor (geli) limited now in normal > operation and slamming the I/O channel during a scrub. It appears that > performance is roughly equivalent, maybe a bit less, than it was with > UFS+SU -- but it's fairly close. > > The operating theory I have at the moment is that the ARC cache was in > some way getting into a near-deadlock situation with other memory > demands on the system (there IS a Postgres server running on this > hardware although it's a replication server and not taking queries -- > nonetheless it does grab a chunk of RAM) leading to the stalls. > Limiting its grab of RAM appears to have to resolved the contention > issue. I was unable to catch it actually running out of free memory > although it was consistently into the low five-digit free page count and > the kernel never garfed on the console about resource exhaustion -- > other than a bitch about swap stalling (the infamous "more than 20 > seconds" message.) Page space in use near the time in question (I could > not get a display while locked as it went to I/O and froze) was not > zero, but pretty close to it (a few thousand blocks.) That the system > was driven into light paging does appear to be significant and > indicative of some sort of memory contention issue as under operation > with UFS filesystems this machine has never been observed to allocate > page space. > > Anyone seen anything like this before and if so.... is this a case of > bad defaults or some bad behavior between various kernel memory > allocation contention sources? > > This isn't exactly a resource-constrained machine running x64 code with > 12GB of RAM and two quad-core processors in it! >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1362449266.92708.8.camel>