From owner-freebsd-stable@FreeBSD.ORG Fri Jun 5 08:44:26 2009 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 085321065673; Fri, 5 Jun 2009 08:44:26 +0000 (UTC) (envelope-from uqs@spoerlein.net) Received: from acme.spoerlein.net (cl-43.dus-01.de.sixxs.net [IPv6:2a01:198:200:2a::2]) by mx1.freebsd.org (Postfix) with ESMTP id 786368FC19; Fri, 5 Jun 2009 08:44:25 +0000 (UTC) (envelope-from uqs@spoerlein.net) Received: from acme.spoerlein.net (localhost.spoerlein.net [127.0.0.1]) by acme.spoerlein.net (8.14.3/8.14.3) with ESMTP id n558iNSm015007 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 5 Jun 2009 10:44:23 +0200 (CEST) (envelope-from uqs@spoerlein.net) Received: (from uqs@localhost) by acme.spoerlein.net (8.14.3/8.14.3/Submit) id n558iNK0015006; Fri, 5 Jun 2009 10:44:23 +0200 (CEST) (envelope-from uqs@spoerlein.net) Date: Fri, 5 Jun 2009 10:44:23 +0200 From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= To: Kip Macy Message-ID: <20090605084423.GA1609@acme.spoerlein.net> Mail-Followup-To: Kip Macy , freebsd-stable@freebsd.org References: <20090602091610.GE93344@acme.spoerlein.net> <20090602092408.GF93344@acme.spoerlein.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20090602092408.GF93344@acme.spoerlein.net> User-Agent: Mutt/1.5.19 (2009-01-05) Cc: freebsd-stable@FreeBSD.org Subject: Re: ZFS weird device tasting loop since MFC X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Jun 2009 08:44:26 -0000 On Tue, 02.06.2009 at 11:24:08 +0200, Ulrich Spörlein wrote: > On Tue, 02.06.2009 at 11:16:10 +0200, Ulrich Spörlein wrote: > > Hi all, > > > > so I went ahead and updated my ~7.2 file server to the new ZFS goodness, > > and before running any further tests, I already discovered something > > weird and annoying. > > > > I'm using a mirror on GELI, where one disk is usually *not* attached as > > a means of poor man's backup. (I had to go that route, as send/recv of > > snapshots frequently deadlocked the system, whereas a mirror scrubbing > > did not) > > > > root@coyote:~# zpool status > > pool: tank > > state: DEGRADED > > status: The pool is formatted using an older on-disk format. The pool can > > still be used, but some features are unavailable. > > action: Upgrade the pool using 'zpool upgrade'. Once this is done, the > > pool will no longer be accessible on older software versions. > > scrub: none requested > > config: > > > > NAME STATE READ WRITE CKSUM > > tank DEGRADED 0 0 0 > > mirror DEGRADED 0 0 0 > > ad4.eli ONLINE 0 0 0 > > 12333765091756463941 REMOVED 0 0 0 was /dev/da0.eli > > > > errors: No known data errors > > > > When imported, there is a constant "tasting" of all devices in the system, > > which also makes the floppy drive go spinning constantly, which is really > > annoying. It did not do this with the old ZFS, are there any remedies? > > > > gstat(8) is displaying the following every other second, together with a > > spinning fd0 drive. > > > > dT: 1.010s w: 1.000s filter: ^...$ > > L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > > 0 0 0 0 0.0 0 0 0.0 0.0| fd0 > > 0 8 8 1014 0.1 0 0 0.0 0.1| md0 > > 0 32 32 4055 9.2 0 0 0.0 29.2| ad0 > > 0 77 10 1267 7.1 63 1125 2.3 31.8| ad4 > > > > There is no activity going on, especially md0 is for /tmp, yet it > > constantly tries to read stuff from everywhere. I will now insert the > > second drive and see if ZFS shuts up then ... > > It does, but it also did not start resilvering the second disk: > > root@coyote:~# zpool status > pool: tank > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > ad4.eli ONLINE 0 0 0 > da0.eli ONLINE 0 0 16 > > errors: No known data errors > > Will now run the scrub and report back in 6-9h. Another datapoint: While the floppy-tasting has stopped, since the mirror sees all devices again, there is some other problem here: root@coyote:/# zpool online tank da0.eli root@coyote:/# zpool status pool: tank state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Fri Jun 5 10:21:36 2009 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 ad4.eli ONLINE 0 0 0 684K resilvered da0.eli ONLINE 0 0 0 2.20M resilvered errors: No known data errors root@coyote:/# zpool offline tank da0.eli root@coyote:/# zpool status pool: tank state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: resilver completed after 0h0m with 0 errors on Fri Jun 5 10:21:36 2009 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror DEGRADED 0 0 0 ad4.eli ONLINE 0 0 0 684K resilvered da0.eli OFFLINE 0 0 0 2.20M resilvered errors: No known data errors root@coyote:/# zpool status pool: tank state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: resilver completed after 0h0m with 0 errors on Fri Jun 5 10:21:36 2009 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror DEGRADED 0 0 0 ad4.eli ONLINE 0 0 0 684K resilvered da0.eli OFFLINE 0 339 0 2.20M resilvered errors: No known data errors root@coyote:/# zpool status pool: tank state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: resilver completed after 0h0m with 0 errors on Fri Jun 5 10:21:36 2009 config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 mirror DEGRADED 0 0 0 ad4.eli ONLINE 0 0 0 684K resilvered da0.eli OFFLINE 0 0 0 2.20M resilvered errors: No known data errors So I ran 'zpool status' thrice after the offline, and the second one reports write errors on the OFFLINE device (WTF?). Running zpool status in a loop, this will constantly show up and then vanish again. I also get constant write requests to the remaining device, even though no applications are accessing it. What the hell is ZFS trying to do here? root@coyote:/# zpool iostat 1 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- tank 883G 48.4G 8 246 56.8K 1.53M tank 883G 48.4G 8 249 55.9K 1.55M tank 883G 48.4G 8 250 55.0K 1.54M tank 883G 48.4G 8 252 54.1K 1.56M tank 883G 48.4G 8 254 53.3K 1.57M tank 883G 48.4G 8 253 52.5K 1.56M tank 883G 48.4G 7 255 51.7K 1.57M ^C Again, WTF? Can someone please enlighten me here? Cheers, Ulrich Spörlein -- http://www.dubistterrorist.de/