Date: Thu, 21 Mar 2013 01:53:05 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Quartz <quartz@sneakertech.com> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS question Message-ID: <20130321085304.GB16997@icarus.home.lan> In-Reply-To: <514AA192.2090006@sneakertech.com> References: <20130321044557.GA15977@icarus.home.lan> <514AA192.2090006@sneakertech.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Mar 21, 2013 at 01:58:42AM -0400, Quartz wrote: > > >1. freebsd-fs is the proper list for filesystem-oriented questions of > >this sort, especially for ZFS. > > Ok, I'm assuming I should subscribe to that list and post there then? Correct. Cross-posting this thread to freebsd-fs (e.g. adding it to the CC line) is generally shunned. I've changed the CC line to use freebsd-fs@ instead, and will follow-up with freebsd-questions@ stating that the thread/discussion has been moved. I've also snipped the rest of our conversation because once I got to the very, VERY end of the convo and recapped what all has been said in this thread (how you reported the problem vs. what the problem is), I realise none of this really matters. I also don't want to get into a discussion about -RELEASE vs. -STABLE because I could practically write a book on the subject (particularly why -STABLE is a better choice). One thing I did want to discuss: > There are eight drives in the machine at the moment, and I'm not > messing with partitions yet because I don't want to complicate things. > (I will eventually be going that route though as the controller tends > to renumber drives in a first-come-first-serve order that makes some > things difficult). Solving this is easy, WITHOUT use of partitions or labels. There is a feature of CAM(4) called "wired down" or "wiring down", where you can in essence statically map a SATA port to a static device number regardless if a disk is inserted at the time the kernel boots (i.e. SATA port 0 on controller X is always ada2, SATA port 1 on controller X is always ada3, SATA port 0 on controller Y is always ada0, etc.). I've discussed how to do this many times over the years, including recently as well. It involves some lines in /boot/loader.conf. It can can sometimes be tricky to figure out depending on the type of controllers you're using, but you do the work/set this up *once* and never touch it again (barring changing brands of controllers). Trust me, it's really not that bad. I can help you with this, but I need to see a dmesg (everything from boot to the point mountroot gets done). > >All that's assuming that the issue truly is ZFS waiting for I/O and not > >something else > > Well, everything I've read so far indicates that zfs has issues when > dealing with un-writable pools, so I assume that's what's going on > here. Let's recap what was said; I'm sorry for hemming and hawing over what was said, but the way your phrased your issue/situation matters. This is how you described your problem initially: > I'm experiencing fatal issues with pools hanging my machine requiring a > hard-reset. This, to me, means something very different than what was described in a subsequent follow-up: > However, when I pop a third drive, the machine becomes VERY unstable. I > can nose around the boot drive just fine, but anything involving i/o > that so much as sneezes in the general direction of the pool hangs the > machine. Once this happens I can log in via ssh, but that's pretty much > it. > > The machine never recovers (at least, not inside 35 minutes, which is > the most I'm willing to wait). Reconnecting the drives has no effect. My > only option is to hard reset the machine with the front panel button. > Googling for info suggested I try changing the pool's "failmode" setting > from "wait" to "continue", but that doesn't appear to make any > difference. For reference, this is a virgin 9.1-release installed off > the dvd image with no ports or packages or any extra anything. So let's recap, along with some answers: S1. In your situation, when a ZFS pool loses enough vdev or vdev members to cause permanent pool damage (as in completely 100% unrecoverable, such as losing 3 disks of a raidz2 pool), any I/O to the pool results in that applications hanging. The system is still functional/usable (e.g. I/O to other pools and non-ZFS filesystems works fine), just that I/O to the now-busted pool hangs indefinitely. A1. This is because "failmode=wait" on the pool, which is the default property value. This is by design; there is no ZFS "timeout" for this sort of thing. "failmode=continue" is what you're looking for (keep reading). S2. If the pool uses "failmode=continue", there is no change in behaviour, (i.e. EIO is still never returned). A2. That sounds like a bug then. I test your claim below, and you might be surprised at the findings. S3. If the previously-yanked disks are reinserted, the issue remains. A3. What you're looking for is the "autoreplace" pool property. However, on FreeBSD, this property is in effect a no-op; manual intervention is always required to replace a disk ("zpool replace"). Solaris/Illumos/etc. don't have this problem because they have proper notification frameworks (fmd/FMF and SMF) that can make this happen. On FreeBSD, you could accomplish running "zpool replace" automatically with devd(8), but that's up to you. Now let's talk about the "failmode=continue" bug/issue. Here's a testbox I use for testing issues with CAM, ZFS, and other bits: root@testbox:/root # uname -a FreeBSD testbox.home.lan 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec 4 09:23:10 UTC 2012 root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 root@testbox:/root # zpool create array raidz2 da1 da2 da3 da4 root@testbox:/root # zpool status pool: array state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM array ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 errors: No known data errors root@testbox:/root # zpool set failmode=continue array Now in another window, launching dd to do some gradual but continuous I/O, and use Ctrl-T (SIGUSR1) to get statuses: root@testbox:/root # dd if=/dev/zero of=/array/testfile bs=1 load: 0.00 cmd: dd 939 [running] 0.62r 0.00u 0.62s 5% 1508k 83348+0 records in 83347+0 records out 83347 bytes transferred in 0.620288 secs (134368 bytes/sec) Now I physically remove da4... root@testbox:/root # zpool status pool: array state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM array DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 da3 ONLINE 0 0 0 9863791736611294808 REMOVED 0 0 0 was /dev/da4 errors: No known data errors dd is still transferring data: load: 0.53 cmd: dd 939 [running] 39.58r 0.55u 38.94s 100% 1512k 5792063+0 records in 5792062+0 records out 5792062 bytes transferred in 39.580059 secs (146338 bytes/sec) Now I physically remove da3... root@testbox:/root # zpool status pool: array state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: none requested config: NAME STATE READ WRITE CKSUM array DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 da1 ONLINE 0 0 0 da2 ONLINE 0 0 0 16564477967045696210 REMOVED 0 0 0 was /dev/da3 9863791736611294808 REMOVED 0 0 0 was /dev/da4 errors: No known data errors dd is still going: load: 0.81 cmd: dd 939 [running] 83.55r 1.28u 81.63s 100% 1512k 12537268+0 records in 12537267+0 records out 12537267 bytes transferred in 83.552147 secs (150053 bytes/sec) Now I physically remove da2... root@testbox:/root # zpool status pool: array state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-JQ scan: none requested config: NAME STATE READ WRITE CKSUM array DEGRADED 0 16 0 raidz2-0 DEGRADED 0 40 0 da1 ONLINE 0 0 0 da2 ONLINE 0 46 0 16564477967045696210 REMOVED 0 0 0 was /dev/da3 9863791736611294808 REMOVED 0 0 0 was /dev/da4 errors: 2 data errors, use '-v' for a list And in the other window where dd is running, it immediately terminates with EIO: dd: /array/testfile: Input/output error 22475027+0 records in 22475026+0 records out 22475026 bytes transferred in 150.249338 secs (149585 bytes/sec) root@testbox:/root # So at this point, I can safely say that ***actively running*** processes which are doing I/O to the pool DO get passed on EIO status. But just wait, the situation gets more interesting... One thing to note (and it's important) above is that da2 is still considered "ONLINE". More on that in a moment. I then decide to then issue some other I/O requests to /array (such as copying /array/testfile to /tmp), to see what the behaviour is in this state: root@testbox:/root # ls -l /array total 21984 -rw-r--r-- 1 root wheel 22475026 Mar 21 01:11 testfile How this ls worked is beyond me, since the pool is effectively broken. Possibly some of this is being pulled from the ARC or vnode caching, I don't know. Anyway, I decide to copy /array/testfile to /tmp to see what happens: root@testbox:/root # cp /array/testfile /tmp load: 0.00 cmd: cp 959 [tx->tx_sync_done_cv)] 4.88r 0.00u 0.10s 0% 2520k load: 0.00 cmd: cp 959 [tx->tx_sync_done_cv)] 7.02r 0.00u 0.10s 0% 2520k ^C^C^C^C^Z Clearly you can see here that a syscall of sorts is stuck indefinitely waiting on the kernel. Kernel call stack for cp: root@testbox:/root # procstat -kk 959 PID TID COMM TDNAME KSTACK 959 100090 cp - mi_switch+0x186 sleepq_wait+0x42 _cv_wait+0x121 txg_wait_synced+0x85 dmu_tx_assign+0x170 zfs_inactive+0xf1 zfs_freebsd_inactive+0x1a vinactive+0x8d vputx+0x2d8 vn_close+0xa4 vn_closefile+0x5d _fdrop+0x23 closef+0x52 kern_close+0x172 amd64_syscall+0x546 Xfast_syscall+0xf7 So while this is going on, I decide to reattach da2 with the plan of issuing "zpool replace array da2" -- sure, even though the pool is completely horked (data loss) at this point, I figure what the hell. Upon inserting da2, CAM and its related bits say nothing about device insertion. When da2 was removed, indeed there were messages. Hmm, this sounds reminiscent of something I've seen recently (keep reading): root@testbox:/root # camcontrol devlist <NECVMWar VMware IDE CDR10 1.00> at scbus1 target 0 lun 0 (pass0,cd0) <VMware, VMware Virtual S 1.0> at scbus2 target 0 lun 0 (pass1,da0) <VMware, VMware Virtual S 1.0> at scbus2 target 1 lun 0 (pass2,da1) <VMware, VMware Virtual S 1.0> at scbus2 target 2 lun 0 (pass3,da2) root@testbox:/root # ls -l /dev/da* crw-r----- 1 root operator 0, 88 Mar 21 00:52 /dev/da0 crw-r----- 1 root operator 0, 94 Mar 21 00:52 /dev/da0p1 crw-r----- 1 root operator 0, 95 Mar 21 00:52 /dev/da0p2 crw-r----- 1 root operator 0, 96 Mar 21 00:52 /dev/da0p3 crw-r----- 1 root operator 0, 89 Mar 21 00:52 /dev/da1 Notice no /dev/da2. So this shouldn't come as much of a surprise: root@testbox:/root # zpool replace array da2 cannot open 'da2': no such GEOM provider must be a full path or shorthand device name This would indicate a separate/different bug, probably in CAM or its related pieces. There were fixes for very similar situations to this in stable/9 recently -- I know because I was the person who reported such. mav@ and ken@ worked out a series of kinks/bugs in CAM pertaining to pass(4) and xpt(4) and some other things. You can read about that here: http://lists.freebsd.org/pipermail/freebsd-fs/2013-February/016515.html http://lists.freebsd.org/pipermail/freebsd-fs/2013-February/016524.html For me to determine if those fixes address the above oddity while testing, I would need to build stable/9 on this testbox. I can do that, and will try to dedicate some time to it tomorrow. So in summary: there seem to be multiple issues shown above, but I can confirm that failmode=continue **does** pass EIO to *running* processes that are doing I/O. Subsequent I/O, however, is questionable at this time. I'll end this Email with (hopefully) an educational statement: I hope my analysis shows you why very thorough, detailed output/etc. needs to be provided when reporting a problem, and not just some "general" description. This is why hard data/logs/etc. are necessary, and why every single step of the way needs to be provided, including physical tasks performed. P.S. -- I started this Email at 23:15 PDT. It's now 01:52 PDT. To whom should I send a bill for time rendered? ;-) -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130321085304.GB16997>