Date: Thu, 4 Apr 2013 11:05:17 -0500 From: Kevin Day <toasty@dragondata.com> To: Andriy Gapon <avg@FreeBSD.org> Cc: freebsd-fs@FreeBSD.org Subject: Re: kern/177536: zfs livelock (deadlock) with high write-to-disk load Message-ID: <D75AD2BC-E02A-45D2-BDDE-57B99ED77AA0@dragondata.com> In-Reply-To: <201304041540.r34Fe1Ka057203@freefall.freebsd.org> References: <201304041540.r34Fe1Ka057203@freefall.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
I'm not sure if I'm experiencing the same thing, but I'm chiming-in in = case this helps someone. We have a server that's configured as such: 9.1-RELEASE amd64, dual Opteron CPU, 64GB of memory. nVidia nForce MCP55 SATA controller * 1x 240GB SSD used for ZFS l2arc mps LSI 9207-8e (LSISAS2308 chip> * Connected to 4 external enclosures, each with 24 3TB drives for a = total of 96 3TB drives, running ZFS in a JBOD configuration twa 3ware 9650SE-12i * Connected 1:1 (no expander) to 12 internal 500GB drives, running UFS = for / and a secondary UFS filesystem When there's very heavy write load to the giant ZFS filesystem (>2gbps = of total incoming data being written), eventually I reach some kind of = deadlock, where I can't do anything that touches any of the block = devices. Processes that attempt to access any filesystem (ZFS or UFS) will get = stuck in 'ufs', 'getblk', 'vnread', or 'tx->tx'. A shell is still = responsive, and I can run commands as long as they're cached. Trying to = run something that wasn't already cached prior to the problem will hang = that shell. 'gstat' shows that most(all?) of the disk devices have outstanding = requests waiting, but a busy percentage of 0% and no activity happening. This only seems to happen under heavy ZFS writes. Heavy ZFS reads, or = heavy UFS writes do not trigger this. Slowing down the ZFS writes will = prevent the problem from occurring. At first I thought this was a controller hang, but after seeing that = devices on three different controllers are all ending up stuck with = outstanding requests is making me a bit confused as to how this could = even happen. Nothing gets logged to the console when this happens.=20 Things I've tried already: 1) Remove the SSD entirely 2) zfs set sync=3Ddisabled fs 3) Letting the system wait (90 minutes) to see if this recovers. 4) Swapped the motherboard/CPUs/memory for an identically configured = system 5) Switched from an LSI 9280 (mpt) to an LSI 9207 (mps) 6) Updated firmware on the storage cards, updated the BIOS on the = motherboard Fair disclosure, these Opterons do have the TLB bug (AMD errata 298), = but the BIOS has a workaround for it which is enabled. We've got dozens = of identical systems to this and aren't experiencing any weird hangs or = anything elsewhere, so I'm assuming this is not it. The problem is that this is a production system that doesn't give me a = lot of time for troubleshooting before I'm forced to reboot it. I'm = going to try to get procstat to stay in the cache so that next time this = happens I can try running it. If there's anything else anyone would like = me to capture when this happens again I'm happy to try.=20 -- Kevin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D75AD2BC-E02A-45D2-BDDE-57B99ED77AA0>