From owner-freebsd-fs@FreeBSD.ORG Sat Jul 6 18:30:51 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CC200587 for ; Sat, 6 Jul 2013 18:30:51 +0000 (UTC) (envelope-from bofh@terranova.net) Received: from tog.net (tog.net [IPv6:2605:5a00::5]) by mx1.freebsd.org (Postfix) with ESMTP id 6412A1F8F for ; Sat, 6 Jul 2013 18:30:51 +0000 (UTC) Received: from [IPv6:2605:5a00:ffff::face] (unknown [IPv6:2605:5a00:ffff::face]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by tog.net (Postfix) with ESMTPSA id 3bnhLx74kSz1jW; Sat, 6 Jul 2013 14:30:49 -0400 (EDT) Message-ID: <51D8624D.30100@terranova.net> Date: Sat, 06 Jul 2013 14:30:37 -0400 From: Travis Mikalson Organization: TerraNovaNet Internet Services User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Report: ZFS deadlock in 9-STABLE References: <51D45401.5050801@terranova.net> <51D5776F.5060101@FreeBSD.org> <51D57C19.1080906@terranova.net> <51D5804B.7090702@FreeBSD.org> <51D586F9.7060508@terranova.net> <20130704145548.GA91766@icarus.home.lan> In-Reply-To: <20130704145548.GA91766@icarus.home.lan> X-Enigmail-Version: 0.96.0 OpenPGP: url=http://www.terranova.net/pgp/bofh Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Jul 2013 18:30:51 -0000 Sorry for the late reply, been enjoying a bit of a holiday away from my PC the last couple of days. Jeremy Chadwick wrote: > I'd like to get output from all of these commands: > > - dmesg (you can hide/XXX out the system name if you want, but please > don't remove anything else, barring IP addresses/etc.) That is now available in file storage1-dmesg under http://tog.net/freebsd/ > - zpool get all File storage1-zpoolgetall > - zfs get all storage1-zfsgetall (big file) > - "gpart show -p" for every disk on the system Available in storage1-gpartshow-p, the SAS disks have no geom since they're being used in their entirety by ZFS. The compact flash and the two SSDs are the only partitioned devices. > - "vmstat -i" when the system is livelocked (if possible; see below) I will try to get that for you if it happens again. Keep in mind if Xin Li's patch is effective we may not get another chance. Very optimistic, I know. I've added vmstat -i > /dev/null to a crontab to keep the vmstat command cached and ready to run during a storage livelock. > - The exact brand and model string of mps(4) controllers you're using These are IBM ServeRAID 1015 controllers flashed with LSI SAS 9211_8i "IT" firmware. They are flashed to be passthrough without RAID functionality or a BIOS you can boot from. If this isn't enough info and is really important, I will need to go pull the server out and open it up to get whatever info the IBM controllers might have on their stickers. > - The exact firmware version and firmware type (often a 2-letter code) > you're using on your mps(4) controllers (dmesg might show some of this > but possibly not all) I've flashed these to LSI firmware type "IT", straight passthrough without RAID functionality. mps0: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd mps1: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd mps2: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd I can confirm that this is accurate, it's 14.00.00.00 I flashed on there from the LSI firmware release notes in the firmware package I used. SCS Engineering Release Notice Phase14 GCA Release Version 14.00.00.00 - SAS2FW_Phase14 (SCGCQ00300504) > - Is powerd(8) running on this system at all? It is not. > Please put these in separate files and upload them to > http://tog.net/freebsd/ if you could. (For the gpart output, you can > put all the output from all the disks in a single file) Done, as described above. > I can see your ZFS disks are probably using those mps(4) controllers. I > also see you have an AHCI controller. Right, three mps controllers plugged into the system and the motherboard's onboard controllers. There's a DVD-ROM connected ATA mode: atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0 ata0: at channel 0 on atapci0 ata1: at channel 1 on atapci0 cd0 at ata0 bus 0 scbus7 target 1 lun 0 And the compact flash is connected to an onboard controller that's in AHCI mode: ahci0: port 0x9000-0x9007,0x8000-0x8003,0x7000-0x7007,0x6000-0x6003,0x5000-0x500f mem 0xfebfbc00-0xfebfbfff irq 22 at device 17.0 on pci0 ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported ahcich0: at channel 0 on ahci0 ahcich1: at channel 1 on ahci0 ahcich2: at channel 2 on ahci0 ahcich3: at channel 3 on ahci0 ada0 at ahcich2 bus 0 scbus5 target 0 lun 0 > I know you can't move all your disks to the AHCI controller due to there > not being enough ports, and the controller might not even work with SAS > disks (depends, some newer/higher end Intel ones do), but: > > A "CF drive locking up too" doesn't really tell us anything about the CF > drive, how it's hooked up, etc... But I'd rather not even go into that, > because: The compact flash is connected using a SYBA SD-ADA40001 SATA II To Compact Flash Adapter. This and the DVD-ROM drive are the only devices connected to the motherboard's onboard controllers. > Advice: > > Hook a SATA disk up to your ahci(4) controller and just leave it there. > No filesystem, just a raw disk sitting on a bus. When the livelock > happens, in another window issue "dd if=/dev/ada0 of=/dev/null bs=64k" > (disk might not be named ada0; again, need that dmesg) and after a > second or two press Ctrl-T to see if you get any output (output should > be immediate). If you do get output, it means GEOM and/or CAM are still > functional in some manner, and that puts more focus on the mps(4) side > of things. There are still nearly infinite explanations for what's > going on though. Which leads me to... Sure, I could do that, but would it be sufficient to try dd'ing from the compact flash device that's already connected via AHCI? I can add a dd command to crontab to run every minute so dd is cached and available for me to run during a loss-of-storage livelock condition. > Question: > > If the system is livelocked, how are you running "procstat -kk -a" in > the first place? Or does it "livelock" and then release itself from the > pain (eventually), only later to re-lock? A "livelock" usually implies > the system is alive in some way (hitting NumLock on the keyboard > (hopefully PS/2) still toggles the LED (kernel does this -- I've used > this as a way to see if a system is locked up or not for years)) just > that some layer pertaining to your focus (ZFS I/O) is wonky. If it > comes and goes, there may be some explanations for that, but output from > those commands would greatly help. This is a permanent livelock, it never recovers on its own. The system requires a hard reset. I have a cron job running every minute that runs procstat -kk -a > /dev/null to ensure that the procstat command is always cached and available to me when I go to use it without any access to storage. During the livelock, I used the ssh session I already had open to run my procstat -kk -a, it was the last thing I could do within that session without resetting the system. After the procstat command completed, I apparently needed an I/O to get my shell prompt back and that never came of course. It's a livelock of the storage-related bits only. Numlock does toggle and you can actually get a response in all the normal ways you would expect to get a response if your FreeBSD system had suddenly lost all contact with all of its storage. You can ping it, establish TCP connections with any listening services but get no banner/greeting response, etc. Console is responsive right up to the point that it needs an I/O. > Question: > > What's with the tunings in loader.conf and sysctl.conf for ZFS? Not > saying those are the issue, just asking why you're setting those at all. > Is there something we need to know about that you've run into in the > past? It's all from fairly well-documented situations, including things from https://wiki.freebsd.org/ZFSTuningGuide The sysctl tunings are as follows: Knowing my HPET is trustworthy, chose it as my best timecounter hardware. Have used default and other timecounters with no effect on my livelock issue. Upped my maxvnodes, merely for performance tuning considering this box's role. This is actually mentioned in the FreeBSD handbook. vfs.zfs.l2arc_write_max=250000000 vfs.zfs.l2arc_write_boost=450000000 These increase the speed at which ZFS will write to cache devices. ZFS, by default, seems to throttle down the speed at which it will write to a cache device quite a lot. These SSDs can handle way more than ZFS was giving them. The loader.conf tunings are as follows: Increased the amount of filesystem metadata ZFS allows to be cached in ARC since I've got many small things I'd like to be able to find in both ARC and ~460GB of L2ARC with less need to walk around on the disks looking for metadata. All things considered, I don't mind my RAM ARC being mostly full of metadata and the actual cached data coming off the SSDs. Increased the ZFS TXG write limit to something appropriately large for the RAM I've got to work with. The last and most recent tunings in loader.conf: I actually ran out of mbuf clusters a month ago and had to hard reset the system to bring the network back. ifconfig down/up actually gave an error along the lines of cannot allocate memory. I increased all related limits to fairly high values as the defaults were apparently too low. And this was just a single gigabit interface that was at 70-90% utilization at the time of total permanent networking death due to mbuf cluster exhaustion.