From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 21 08:53:08 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 12E32E9A
 for <freebsd-fs@freebsd.org>; Thu, 21 Mar 2013 08:53:08 +0000 (UTC)
 (envelope-from jdc@koitsu.org)
Received: from qmta12.emeryville.ca.mail.comcast.net
 (qmta12.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:227])
 by mx1.freebsd.org (Postfix) with ESMTP id 86CD33F4
 for <freebsd-fs@freebsd.org>; Thu, 21 Mar 2013 08:53:07 +0000 (UTC)
Received: from omta06.emeryville.ca.mail.comcast.net ([76.96.30.51])
 by qmta12.emeryville.ca.mail.comcast.net with comcast
 id E8sC1l00216AWCUAC8t6sF; Thu, 21 Mar 2013 08:53:06 +0000
Received: from koitsu.strangled.net ([67.180.84.87])
 by omta06.emeryville.ca.mail.comcast.net with comcast
 id E8t51l0051t3BNj8S8t5mB; Thu, 21 Mar 2013 08:53:05 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
 id 06DA373A1C; Thu, 21 Mar 2013 01:53:05 -0700 (PDT)
Date: Thu, 21 Mar 2013 01:53:05 -0700
From: Jeremy Chadwick <jdc@koitsu.org>
To: Quartz <quartz@sneakertech.com>
Subject: Re: ZFS question
Message-ID: <20130321085304.GB16997@icarus.home.lan>
References: <20130321044557.GA15977@icarus.home.lan>
 <514AA192.2090006@sneakertech.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <514AA192.2090006@sneakertech.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net;
 s=q20121106; t=1363855986;
 bh=baLzKJJBCnjjQfirS66V/ar7JjtznQKqtHg4/ZOyjPY=;
 h=Received:Received:Received:Date:From:To:Subject:Message-ID:
 MIME-Version:Content-Type;
 b=bSA3vYK3Xx2tPNXn0/EbmUdYw6ETilVQQks6xp6wm98+bnd8NGVIuJbNhXggH7BoW
 RdgDCKAiFw54aAUQVYpqPLPHuNVfSvfoHqn1NxgYtpk1mMWOrOximL9BXTdPkXvNGw
 TCLhIr9CiB5R+d6cxHqgulLvjUsBl3tHDjBUYOS2LJwoDKGIBsMsQ7kVDGC+mDHNaS
 5j0FnSvkoqyqyKIMOKtzouyU3OOkKTPkKw9ISAMal2sb1Z4Tt3s1HU9cnUZ5+XrEK/
 qY+hKfP3nCl/3hSymmCVXX7NrWceei8kBFKH7I7KRF0nLgfAJTm9UoeuMrzveL3WS0
 g4fGFAvl5wy5g==
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Mar 2013 08:53:08 -0000

On Thu, Mar 21, 2013 at 01:58:42AM -0400, Quartz wrote:
> 
> >1. freebsd-fs is the proper list for filesystem-oriented questions of
> >this sort, especially for ZFS.
> 
> Ok, I'm assuming I should subscribe to that list and post there then?

Correct.  Cross-posting this thread to freebsd-fs (e.g. adding it to the
CC line) is generally shunned.

I've changed the CC line to use freebsd-fs@ instead, and will follow-up
with freebsd-questions@ stating that the thread/discussion has been
moved.

I've also snipped the rest of our conversation because once I got to the
very, VERY end of the convo and recapped what all has been said in this
thread (how you reported the problem vs. what the problem is), I realise
none of this really matters.  I also don't want to get into a discussion
about -RELEASE vs. -STABLE because I could practically write a book on
the subject (particularly why -STABLE is a better choice).

One thing I did want to discuss:

> There are eight drives in the machine at the moment, and I'm not
> messing with partitions yet because I don't want to complicate things.
> (I will eventually be going that route though as the controller tends
> to renumber drives in a first-come-first-serve order that makes some
> things difficult).

Solving this is easy, WITHOUT use of partitions or labels.  There is a
feature of CAM(4) called "wired down" or "wiring down", where you can in
essence statically map a SATA port to a static device number regardless
if a disk is inserted at the time the kernel boots (i.e. SATA port 0 on
controller X is always ada2, SATA port 1 on controller X is always ada3,
SATA port 0 on controller Y is always ada0, etc.).

I've discussed how to do this many times over the years, including
recently as well.  It involves some lines in /boot/loader.conf.  It can
can sometimes be tricky to figure out depending on the type of
controllers you're using, but you do the work/set this up *once* and
never touch it again (barring changing brands of controllers).  Trust
me, it's really not that bad.

I can help you with this, but I need to see a dmesg (everything from
boot to the point mountroot gets done).

> >All that's assuming that the issue truly is ZFS waiting for I/O and not
> >something else
> 
> Well, everything I've read so far indicates that zfs has issues when
> dealing with un-writable pools, so I assume that's what's going on
> here.

Let's recap what was said; I'm sorry for hemming and hawing over what
was said, but the way your phrased your issue/situation matters.  This
is how you described your problem initially:

> I'm experiencing fatal issues with pools hanging my machine requiring a 
> hard-reset.

This, to me, means something very different than what was described in a
subsequent follow-up:

> However, when I pop a third drive, the machine becomes VERY unstable. I
> can nose around the boot drive just fine, but anything involving i/o
> that so much as sneezes in the general direction of the pool hangs the
> machine. Once this happens I can log in via ssh, but that's pretty much
> it.
> 
> The machine never recovers (at least, not inside 35 minutes, which is
> the most I'm willing to wait). Reconnecting the drives has no effect. My
> only option is to hard reset the machine with the front panel button.
> Googling for info suggested I try changing the pool's "failmode" setting 
> from "wait" to "continue", but that doesn't appear to make any 
> difference. For reference, this is a virgin 9.1-release installed off 
> the dvd image with no ports or packages or any extra anything.

So let's recap, along with some answers:

S1. In your situation, when a ZFS pool loses enough vdev or vdev members
to cause permanent pool damage (as in completely 100% unrecoverable,
such as losing 3 disks of a raidz2 pool), any I/O to the pool results in
that applications hanging.  The system is still functional/usable (e.g.
I/O to other pools and non-ZFS filesystems works fine), just that I/O to
the now-busted pool hangs indefinitely.

A1. This is because "failmode=wait" on the pool, which is the default
property value.  This is by design; there is no ZFS "timeout" for this
sort of thing.  "failmode=continue" is what you're looking for (keep
reading).

S2. If the pool uses "failmode=continue", there is no change in
behaviour, (i.e. EIO is still never returned).

A2. That sounds like a bug then.  I test your claim below, and you might
be surprised at the findings.

S3. If the previously-yanked disks are reinserted, the issue remains.

A3. What you're looking for is the "autoreplace" pool property.
However, on FreeBSD, this property is in effect a no-op; manual
intervention is always required to replace a disk ("zpool replace").
Solaris/Illumos/etc. don't have this problem because they have proper
notification frameworks (fmd/FMF and SMF) that can make this happen.  On
FreeBSD, you could accomplish running "zpool replace" automatically with
devd(8), but that's up to you.


Now let's talk about the "failmode=continue" bug/issue.  Here's a
testbox I use for testing issues with CAM, ZFS, and other bits:

root@testbox:/root # uname -a
FreeBSD testbox.home.lan 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

root@testbox:/root # zpool create array raidz2 da1 da2 da3 da4

root@testbox:/root # zpool status
  pool: array
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        array       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0

errors: No known data errors

root@testbox:/root # zpool set failmode=continue array

Now in another window, launching dd to do some gradual but continuous
I/O, and use Ctrl-T (SIGUSR1) to get statuses:

root@testbox:/root # dd if=/dev/zero of=/array/testfile bs=1
load: 0.00  cmd: dd 939 [running] 0.62r 0.00u 0.62s 5% 1508k
83348+0 records in
83347+0 records out
83347 bytes transferred in 0.620288 secs (134368 bytes/sec)

Now I physically remove da4...

root@testbox:/root # zpool status
  pool: array
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                     STATE     READ WRITE CKSUM
        array                    DEGRADED     0     0     0
          raidz2-0               DEGRADED     0     0     0
            da1                  ONLINE       0     0     0
            da2                  ONLINE       0     0     0
            da3                  ONLINE       0     0     0
            9863791736611294808  REMOVED      0     0     0  was /dev/da4

errors: No known data errors

dd is still transferring data:

load: 0.53  cmd: dd 939 [running] 39.58r 0.55u 38.94s 100% 1512k
5792063+0 records in
5792062+0 records out
5792062 bytes transferred in 39.580059 secs (146338 bytes/sec)

Now I physically remove da3...

root@testbox:/root # zpool status
  pool: array
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        array                     DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            da1                   ONLINE       0     0     0
            da2                   ONLINE       0     0     0
            16564477967045696210  REMOVED      0     0     0  was /dev/da3
            9863791736611294808   REMOVED      0     0     0  was /dev/da4

errors: No known data errors

dd is still going:

load: 0.81  cmd: dd 939 [running] 83.55r 1.28u 81.63s 100% 1512k
12537268+0 records in
12537267+0 records out
12537267 bytes transferred in 83.552147 secs (150053 bytes/sec)

Now I physically remove da2...

root@testbox:/root # zpool status
  pool: array
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://illumos.org/msg/ZFS-8000-JQ
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        array                     DEGRADED     0    16     0
          raidz2-0                DEGRADED     0    40     0
            da1                   ONLINE       0     0     0
            da2                   ONLINE       0    46     0
            16564477967045696210  REMOVED      0     0     0  was /dev/da3
            9863791736611294808   REMOVED      0     0     0  was /dev/da4

errors: 2 data errors, use '-v' for a list

And in the other window where dd is running, it immediately terminates
with EIO:

dd: /array/testfile: Input/output error
22475027+0 records in
22475026+0 records out
22475026 bytes transferred in 150.249338 secs (149585 bytes/sec)
root@testbox:/root #

So at this point, I can safely say that ***actively running*** processes
which are doing I/O to the pool DO get passed on EIO status.  But just
wait, the situation gets more interesting...

One thing to note (and it's important) above is that da2 is still
considered "ONLINE".  More on that in a moment.

I then decide to then issue some other I/O requests to /array (such as
copying /array/testfile to /tmp), to see what the behaviour is in this
state:

root@testbox:/root # ls -l /array
total 21984
-rw-r--r--  1 root  wheel  22475026 Mar 21 01:11 testfile

How this ls worked is beyond me, since the pool is effectively broken.
Possibly some of this is being pulled from the ARC or vnode caching, I
don't know.  Anyway, I decide to copy /array/testfile to /tmp to see
what happens:

root@testbox:/root # cp /array/testfile /tmp
load: 0.00  cmd: cp 959 [tx->tx_sync_done_cv)] 4.88r 0.00u 0.10s 0% 2520k
load: 0.00  cmd: cp 959 [tx->tx_sync_done_cv)] 7.02r 0.00u 0.10s 0% 2520k
^C^C^C^C^Z

Clearly you can see here that a syscall of sorts is stuck indefinitely
waiting on the kernel.  Kernel call stack for cp:

root@testbox:/root # procstat -kk 959
  PID    TID COMM             TDNAME           KSTACK
  959 100090 cp               -                mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_synced+0x85 dmu_tx_assign+0x170 zfs_inactive+0xf1 zfs_freebsd_inactive+0x1a vinactive+0x8d vputx+0x2d8 vn_close+0xa4 vn_closefile+0x5d _fdrop+0x23 closef+0x52 kern_close+0x172 amd64_syscall+0x546 Xfast_syscall+0xf7

So while this is going on, I decide to reattach da2 with the plan of
issuing "zpool replace array da2" -- sure, even though the pool is
completely horked (data loss) at this point, I figure what the hell.

Upon inserting da2, CAM and its related bits say nothing about device
insertion.  When da2 was removed, indeed there were messages.  Hmm, this
sounds reminiscent of something I've seen recently (keep reading):

root@testbox:/root # camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (pass0,cd0)
<VMware, VMware Virtual S 1.0>     at scbus2 target 0 lun 0 (pass1,da0)
<VMware, VMware Virtual S 1.0>     at scbus2 target 1 lun 0 (pass2,da1)
<VMware, VMware Virtual S 1.0>     at scbus2 target 2 lun 0 (pass3,da2)

root@testbox:/root # ls -l /dev/da*
crw-r-----  1 root  operator    0,  88 Mar 21 00:52 /dev/da0
crw-r-----  1 root  operator    0,  94 Mar 21 00:52 /dev/da0p1
crw-r-----  1 root  operator    0,  95 Mar 21 00:52 /dev/da0p2
crw-r-----  1 root  operator    0,  96 Mar 21 00:52 /dev/da0p3
crw-r-----  1 root  operator    0,  89 Mar 21 00:52 /dev/da1

Notice no /dev/da2.  So this shouldn't come as much of a surprise:

root@testbox:/root # zpool replace array da2
cannot open 'da2': no such GEOM provider
must be a full path or shorthand device name

This would indicate a separate/different bug, probably in CAM or its
related pieces.

There were fixes for very similar situations to this in stable/9
recently -- I know because I was the person who reported such.  mav@ and
ken@ worked out a series of kinks/bugs in CAM pertaining to pass(4) and
xpt(4) and some other things.  You can read about that here:

http://lists.freebsd.org/pipermail/freebsd-fs/2013-February/016515.html
http://lists.freebsd.org/pipermail/freebsd-fs/2013-February/016524.html

For me to determine if those fixes address the above oddity while
testing, I would need to build stable/9 on this testbox.  I can do that,
and will try to dedicate some time to it tomorrow.

So in summary: there seem to be multiple issues shown above, but I can
confirm that failmode=continue **does** pass EIO to *running* processes
that are doing I/O.  Subsequent I/O, however, is questionable at this
time.

I'll end this Email with (hopefully) an educational statement:  I hope
my analysis shows you why very thorough, detailed output/etc. needs to
be provided when reporting a problem, and not just some "general"
description.  This is why hard data/logs/etc. are necessary, and why
every single step of the way needs to be provided, including physical
tasks performed.

P.S. -- I started this Email at 23:15 PDT.  It's now 01:52 PDT.  To whom
should I send a bill for time rendered?  ;-)

-- 
| Jeremy Chadwick                                   jdc@koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |