From owner-freebsd-fs@FreeBSD.ORG Mon Oct 27 23:13:44 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EED6C9BD for ; Mon, 27 Oct 2014 23:13:44 +0000 (UTC) Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com [IPv6:2a00:1450:400c:c05::231]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 703C730A for ; Mon, 27 Oct 2014 23:13:44 +0000 (UTC) Received: by mail-wi0-f177.google.com with SMTP id ex7so6002wid.16 for ; Mon, 27 Oct 2014 16:13:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nofocus.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=OCemWUEe5obl8Ac/+eSYevorlVK9O8ERvOttN0sdQu4=; b=NwNaZf4Ori+5PJajm3XYnq8piLRbuQN2cPb99bhvvLX74wVp7xYZ4li9SdFY+z8yT7 MMZyE+jLIBh09k4s0AgOJ1ggR3l61mvYpJZTtW8w4RsmCYajK1aFYUrPjLaLF37r9sVM 7ksFzkQOKai3HjLksFCszi+U4wMN28v3qYoyI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=OCemWUEe5obl8Ac/+eSYevorlVK9O8ERvOttN0sdQu4=; b=Ac9jBM8HXcQdR4u66gtyxAgPG82tkKGEbe7u9a0owtrT8ivIKZWQ3WHr6/ubmWo0yO B+yEqlDgeorNWs5jopBvn18HlNJ4exgB+J5LkdYA99L4wdPSHgeijM7x9ooNdsAEPCQs z1/8UAu8Uq83z57oiahIqq4/zbc3uP9Wvd/4ZKMdTfIj/gc7P+sBIGhKrsl2loMnU0YC X5eS2os7oojElgtICLpgcrj2l+rcHXU8CHHp4LiGItqxNyQmwMxmGGqLGFoLlsYEdwgp Inb0V0JKR3gc+70vNJ+pBAwVoBET8wDIEp7Td3Hnb8K72jjMl5gdarvHY7/DWcvNECma lRjQ== X-Gm-Message-State: ALoCoQkaOpCm/6JSedzhp75ZTg9n+djAYXjvFCzg5tcwhGoCHFA6GpVePYRt1+qBPEyYirP1IqdT X-Received: by 10.194.58.205 with SMTP id t13mr24834934wjq.55.1414451622405; Mon, 27 Oct 2014 16:13:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.180.103.10 with HTTP; Mon, 27 Oct 2014 16:13:22 -0700 (PDT) In-Reply-To: References: <544B12B8.8060302@freebsd.org> From: Robert Banz Date: Mon, 27 Oct 2014 16:13:22 -0700 Message-ID: Subject: Re: ZFS errors on the array but not the disk. To: Zaphod Beeblebrox Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: freebsd-fs , Steven Hartland X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Oct 2014 23:13:45 -0000 Have you tried different hardware? This screams something's up anywhere in the stack -- DRAM, cabling, controller... On Mon, Oct 27, 2014 at 11:34 AM, Zaphod Beeblebrox wrote: > Ok... This is just frustrating. I've trusted ZFS through many versions ... > and pretty much ... it's delivered. There are five symptoms here: > > 1. after each reboot, resilver starts again... even if after the resilver I > complete a full scrub. > > 2. seemingly random objects (files, zvols or snapshot items) get marked as > having errors. when I say random, to be clear; different items each time. > > 3. none of the drives are showing errors in zpool status, neither are they > chucking errors into dmesg. > > 4. errors are being logged against the vdev (only one of the two vdevs) and > the array (half as many as the vdev). > > 5. The activity light for the recently replaced disk does not "flash" > "with" the others in it's vdev during either resilver or scrub. This last > bit might need some explanation. I realize that raidz-1 stripes do not > always use all the disks, but "generally" the activity lights of the drives > in a vdev go "together"... In this case, the light of the recently replaced > drive is off much of the time ... > > Is there anything I can/should do? I pulled the new disk, moved it's > partitions around (it's larger than the array disks because you can't buy > 1.5T drives anymore) and then re-added it... so I've tried that. > > > On Fri, Oct 24, 2014 at 11:47 PM, Zaphod Beeblebrox > wrote: > > > Thanks for the heads up. I'm following releng/10.1 and 271683 seems to > be > > part of that, but a good catch/guess. > > > > > > On Fri, Oct 24, 2014 at 11:02 PM, Steven Hartland > wrote: > > > >> There was an issue which would cause resilver restarts fixed by > *265253* < > >> https://svnweb.freebsd.org/base?view=revision&revision=265253> which > was > >> MFC'ed to stable/10 by *271683* >> base?view=revision&revision=271683>so you'll want to make sure your > >> latter than that. > >> > >> > >> On 24/10/2014 19:42, Zaphod Beeblebrox wrote: > >> > >>> I manually replaced a disk... and the array was scrubbed recently. > >>> Interestingly, I seem to be in the "endless loop" of resilvering > >>> problem. > >>> Not much I can find on it. but resilvering will complete and I can > then > >>> run another scrub. It will complete, too. Then rebooting causes > another > >>> resilvering. > >>> > >>> Another odd data point: it seems as if the things that show up as > >>> "errors" > >>> change from resilvering to resilvering. > >>> > >>> One bug, it would seem, is that once ZFS has detected an error... > another > >>> scrub can reset it, but no attempt is made to read-through the error if > >>> you > >>> access the object directly. > >>> > >>> On Fri, Oct 24, 2014 at 11:33 AM, Alan Somers > >>> wrote: > >>> > >>> On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox < > zbeeble@gmail.com> > >>>> wrote: > >>>> > >>>>> What does it mean when checksum errors appear on the array (and the > >>>>> vdev) > >>>>> but not on any of the disks? See the paste below. One would think > >>>>> that > >>>>> there isn't some ephemeral data stored somewhere that is not one of > the > >>>>> disks, yet "cksum" errors show only on the vdev and the array lines. > >>>>> > >>>> Help? > >>>> > >>>>> [2:17:316]root@virtual:/vr2/torrent/in> zpool status > >>>>> pool: vr2 > >>>>> state: ONLINE > >>>>> status: One or more devices is currently being resilvered. The pool > >>>>> will > >>>>> continue to function, possibly in a degraded state. > >>>>> action: Wait for the resilver to complete. > >>>>> scan: resilver in progress since Thu Oct 23 23:11:29 2014 > >>>>> 1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go > >>>>> 119G resilvered, 6.79% done > >>>>> config: > >>>>> > >>>>> NAME STATE READ WRITE CKSUM > >>>>> vr2 ONLINE 0 0 36 > >>>>> raidz1-0 ONLINE 0 0 72 > >>>>> label/vr2-d0 ONLINE 0 0 0 > >>>>> label/vr2-d1 ONLINE 0 0 0 > >>>>> gpt/vr2-d2c ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native (resilvering) > >>>>> gpt/vr2-d3b ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-d4a ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> ada14 ONLINE 0 0 0 > >>>>> label/vr2-d6 ONLINE 0 0 0 > >>>>> label/vr2-d7c ONLINE 0 0 0 > >>>>> label/vr2-d8 ONLINE 0 0 0 > >>>>> raidz1-1 ONLINE 0 0 0 > >>>>> gpt/vr2-e0 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e1 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e2 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e3 ONLINE 0 0 0 > >>>>> gpt/vr2-e4 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e5 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e6 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> gpt/vr2-e7 ONLINE 0 0 0 block size: > >>>>> 512B > >>>>> configured, 4096B native > >>>>> > >>>>> errors: 43 data errors, use '-v' for a list > >>>>> > >>>> The checksum errors will appear on the raidz vdev instead of a leaf if > >>>> vdev_raidz.c can't determine which leaf vdev was responsible. This > >>>> could happen if two or more leaf vdevs return bad data for the same > >>>> block, which would also lead to unrecoverable data errors. I see that > >>>> you have some unrecoverable data errors, so maybe that's what happened > >>>> to you. > >>>> > >>>> Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable > >>>> to determine which child was responsible for a checksum error. > >>>> However, I've only seen that happen when a raidz vdev has a mirror > >>>> child. That can only happen if the child is a spare or replacing > >>>> vdev. Did you activate any spares, or did you manually replace a > >>>> vdev? > >>>> > >>>> -Alan > >>>> > >>>> _______________________________________________ > >>> freebsd-fs@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > >>> > >>> > >>> > >> _______________________________________________ > >> freebsd-fs@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > >> > > > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >