From owner-freebsd-fs@FreeBSD.ORG  Sat Jul  6 18:30:51 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id CC200587
 for <freebsd-fs@freebsd.org>; Sat,  6 Jul 2013 18:30:51 +0000 (UTC)
 (envelope-from bofh@terranova.net)
Received: from tog.net (tog.net [IPv6:2605:5a00::5])
 by mx1.freebsd.org (Postfix) with ESMTP id 6412A1F8F
 for <freebsd-fs@freebsd.org>; Sat,  6 Jul 2013 18:30:51 +0000 (UTC)
Received: from [IPv6:2605:5a00:ffff::face] (unknown
 [IPv6:2605:5a00:ffff::face])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by tog.net (Postfix) with ESMTPSA id 3bnhLx74kSz1jW;
 Sat,  6 Jul 2013 14:30:49 -0400 (EDT)
Message-ID: <51D8624D.30100@terranova.net>
Date: Sat, 06 Jul 2013 14:30:37 -0400
From: Travis Mikalson <bofh@terranova.net>
Organization: TerraNovaNet Internet Services
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: Report: ZFS deadlock in 9-STABLE
References: <51D45401.5050801@terranova.net> <51D5776F.5060101@FreeBSD.org>
 <51D57C19.1080906@terranova.net> <51D5804B.7090702@FreeBSD.org>
 <51D586F9.7060508@terranova.net> <20130704145548.GA91766@icarus.home.lan>
In-Reply-To: <20130704145548.GA91766@icarus.home.lan>
X-Enigmail-Version: 0.96.0
OpenPGP: url=http://www.terranova.net/pgp/bofh
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 06 Jul 2013 18:30:51 -0000

Sorry for the late reply, been enjoying a bit of a holiday away from my
PC the last couple of days.

Jeremy Chadwick wrote:
> I'd like to get output from all of these commands:
> 
> - dmesg  (you can hide/XXX out the system name if you want, but please
>   don't remove anything else, barring IP addresses/etc.)

That is now available in file storage1-dmesg under http://tog.net/freebsd/

> - zpool get all

File storage1-zpoolgetall

> - zfs get all

storage1-zfsgetall (big file)

> - "gpart show -p" for every disk on the system

Available in storage1-gpartshow-p, the SAS disks have no geom since
they're being used in their entirety by ZFS. The compact flash and the
two SSDs are the only partitioned devices.

> - "vmstat -i" when the system is livelocked (if possible; see below)

I will try to get that for you if it happens again. Keep in mind if Xin
Li's patch is effective we may not get another chance. Very optimistic,
I know. I've added vmstat -i > /dev/null to a crontab to keep the vmstat
command cached and ready to run during a storage livelock.

> - The exact brand and model string of mps(4) controllers you're using

These are IBM ServeRAID 1015 controllers flashed with LSI SAS 9211_8i
"IT" firmware. They are flashed to be passthrough without RAID
functionality or a BIOS you can boot from. If this isn't enough info and
is really important, I will need to go pull the server out and open it
up to get whatever info the IBM controllers might have on their stickers.

> - The exact firmware version and firmware type (often a 2-letter code)
>   you're using on your mps(4) controllers (dmesg might show some of this
>   but possibly not all)

I've flashed these to LSI firmware type "IT", straight passthrough
without RAID functionality.

mps0: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd
mps1: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd
mps2: Firmware: 14.00.00.00, Driver: 14.00.00.01-fbsd

I can confirm that this is accurate, it's 14.00.00.00 I flashed on there
from the LSI firmware release notes in the firmware package I used.

SCS Engineering Release Notice
Phase14 GCA Release Version 14.00.00.00 - SAS2FW_Phase14 (SCGCQ00300504)

> - Is powerd(8) running on this system at all?

It is not.

> Please put these in separate files and upload them to
> http://tog.net/freebsd/ if you could.  (For the gpart output, you can
> put all the output from all the disks in a single file)

Done, as described above.

> I can see your ZFS disks are probably using those mps(4) controllers.  I
> also see you have an AHCI controller.

Right, three mps controllers plugged into the system and the
motherboard's onboard controllers. There's a DVD-ROM connected ATA mode:

atapci0: <ATI IXP700/800 UDMA133 controller> port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
ata0: <ATA channel> at channel 0 on atapci0
ata1: <ATA channel> at channel 1 on atapci0
cd0 at ata0 bus 0 scbus7 target 1 lun 0

And the compact flash is connected to an onboard controller that's in
AHCI mode:
ahci0: <ATI IXP700 AHCI SATA controller> port
0x9000-0x9007,0x8000-0x8003,0x7000-0x7007,0x6000-0x6003,0x5000-0x500f
mem 0xfebfbc00-0xfebfbfff irq 22 at device 17.0 on pci0
ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ada0 at ahcich2 bus 0 scbus5 target 0 lun 0

> I know you can't move all your disks to the AHCI controller due to there
> not being enough ports, and the controller might not even work with SAS
> disks (depends, some newer/higher end Intel ones do), but:
> 
> A "CF drive locking up too" doesn't really tell us anything about the CF
> drive, how it's hooked up, etc...  But I'd rather not even go into that,
> because:

The compact flash is connected using a SYBA SD-ADA40001 SATA II To
Compact Flash Adapter. This and the DVD-ROM drive are the only devices
connected to the motherboard's onboard controllers.

> Advice:
> 
> Hook a SATA disk up to your ahci(4) controller and just leave it there.
> No filesystem, just a raw disk sitting on a bus.  When the livelock
> happens, in another window issue "dd if=/dev/ada0 of=/dev/null bs=64k"
> (disk might not be named ada0; again, need that dmesg) and after a
> second or two press Ctrl-T to see if you get any output (output should
> be immediate).  If you do get output, it means GEOM and/or CAM are still
> functional in some manner, and that puts more focus on the mps(4) side
> of things.  There are still nearly infinite explanations for what's
> going on though.  Which leads me to...

Sure, I could do that, but would it be sufficient to try dd'ing from the
compact flash device that's already connected via AHCI? I can add a dd
command to crontab to run every minute so dd is cached and available for
me to run during a loss-of-storage livelock condition.

> Question:
> 
> If the system is livelocked, how are you running "procstat -kk -a" in
> the first place?  Or does it "livelock" and then release itself from the
> pain (eventually), only later to re-lock?  A "livelock" usually implies
> the system is alive in some way (hitting NumLock on the keyboard
> (hopefully PS/2) still toggles the LED (kernel does this -- I've used
> this as a way to see if a system is locked up or not for years)) just
> that some layer pertaining to your focus (ZFS I/O) is wonky.  If it
> comes and goes, there may be some explanations for that, but output from
> those commands would greatly help.

This is a permanent livelock, it never recovers on its own. The system
requires a hard reset.

I have a cron job running every minute that runs procstat -kk -a >
/dev/null to ensure that the procstat command is always cached and
available to me when I go to use it without any access to storage.

During the livelock, I used the ssh session I already had open to run my
procstat -kk -a, it was the last thing I could do within that session
without resetting the system. After the procstat command completed, I
apparently needed an I/O to get my shell prompt back and that never came
of course.

It's a livelock of the storage-related bits only. Numlock does toggle
and you can actually get a response in all the normal ways you would
expect to get a response if your FreeBSD system had suddenly lost all
contact with all of its storage. You can ping it, establish TCP
connections with any listening services but get no banner/greeting
response, etc. Console is responsive right up to the point that it needs
an I/O.

> Question:
> 
> What's with the tunings in loader.conf and sysctl.conf for ZFS?  Not
> saying those are the issue, just asking why you're setting those at all.
> Is there something we need to know about that you've run into in the
> past?

It's all from fairly well-documented situations, including things from
https://wiki.freebsd.org/ZFSTuningGuide

The sysctl tunings are as follows:
Knowing my HPET is trustworthy, chose it as my best timecounter
hardware. Have used default and other timecounters with no effect on my
livelock issue. Upped my maxvnodes, merely for performance tuning
considering this box's role. This is actually mentioned in the FreeBSD
handbook.

vfs.zfs.l2arc_write_max=250000000
vfs.zfs.l2arc_write_boost=450000000
These increase the speed at which ZFS will write to cache devices. ZFS,
by default, seems to throttle down the speed at which it will write to a
cache device quite a lot. These SSDs can handle way more than ZFS was
giving them.

The loader.conf tunings are as follows:
Increased the amount of filesystem metadata ZFS allows to be cached in
ARC since I've got many small things I'd like to be able to find in both
ARC and ~460GB of L2ARC with less need to walk around on the disks
looking for metadata. All things considered, I don't mind my RAM ARC
being mostly full of metadata and the actual cached data coming off the
SSDs.

Increased the ZFS TXG write limit to something appropriately large for
the RAM I've got to work with.

The last and most recent tunings in loader.conf: I actually ran out of
mbuf clusters a month ago and had to hard reset the system to bring the
network back. ifconfig down/up actually gave an error along the lines of
cannot allocate memory. I increased all related limits to fairly high
values as the defaults were apparently too low. And this was just a
single gigabit interface that was at 70-90% utilization at the time of
total permanent networking death due to mbuf cluster exhaustion.