From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 27 23:13:44 2014
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EED6C9BD
 for <freebsd-fs@freebsd.org>; Mon, 27 Oct 2014 23:13:44 +0000 (UTC)
Received: from mail-wi0-x231.google.com (mail-wi0-x231.google.com
 [IPv6:2a00:1450:400c:c05::231])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 703C730A
 for <freebsd-fs@freebsd.org>; Mon, 27 Oct 2014 23:13:44 +0000 (UTC)
Received: by mail-wi0-f177.google.com with SMTP id ex7so6002wid.16
 for <freebsd-fs@freebsd.org>; Mon, 27 Oct 2014 16:13:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nofocus.org; s=google;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-type;
 bh=OCemWUEe5obl8Ac/+eSYevorlVK9O8ERvOttN0sdQu4=;
 b=NwNaZf4Ori+5PJajm3XYnq8piLRbuQN2cPb99bhvvLX74wVp7xYZ4li9SdFY+z8yT7
 MMZyE+jLIBh09k4s0AgOJ1ggR3l61mvYpJZTtW8w4RsmCYajK1aFYUrPjLaLF37r9sVM
 7ksFzkQOKai3HjLksFCszi+U4wMN28v3qYoyI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc:content-type;
 bh=OCemWUEe5obl8Ac/+eSYevorlVK9O8ERvOttN0sdQu4=;
 b=Ac9jBM8HXcQdR4u66gtyxAgPG82tkKGEbe7u9a0owtrT8ivIKZWQ3WHr6/ubmWo0yO
 B+yEqlDgeorNWs5jopBvn18HlNJ4exgB+J5LkdYA99L4wdPSHgeijM7x9ooNdsAEPCQs
 z1/8UAu8Uq83z57oiahIqq4/zbc3uP9Wvd/4ZKMdTfIj/gc7P+sBIGhKrsl2loMnU0YC
 X5eS2os7oojElgtICLpgcrj2l+rcHXU8CHHp4LiGItqxNyQmwMxmGGqLGFoLlsYEdwgp
 Inb0V0JKR3gc+70vNJ+pBAwVoBET8wDIEp7Td3Hnb8K72jjMl5gdarvHY7/DWcvNECma
 lRjQ==
X-Gm-Message-State: ALoCoQkaOpCm/6JSedzhp75ZTg9n+djAYXjvFCzg5tcwhGoCHFA6GpVePYRt1+qBPEyYirP1IqdT
X-Received: by 10.194.58.205 with SMTP id t13mr24834934wjq.55.1414451622405;
 Mon, 27 Oct 2014 16:13:42 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.180.103.10 with HTTP; Mon, 27 Oct 2014 16:13:22 -0700 (PDT)
In-Reply-To: <CACpH0MdQDi85pvks+E1A2OYRKYXi6CMiXcsL4U1Ud5r_Zw4d8g@mail.gmail.com>
References: <CACpH0MeAvs6rzWUo3uF8uTygPk6qnZE8W=3-zsiTAKdvm4N01w@mail.gmail.com>
 <CAOtMX2g5GYZqYgWNmD_K_TSdTc8oxvvpe4463ni=sEX_b7_Erw@mail.gmail.com>
 <CACpH0MfL1J8fbP+Mkdop8C=iTJmvscDv16mVynSqXC0uspdLfw@mail.gmail.com>
 <544B12B8.8060302@freebsd.org>
 <CACpH0Md8f1dAqUvgAMnKN+iZbWmL2ANXuwj7xDqkiGcHaiS9jg@mail.gmail.com>
 <CACpH0MdQDi85pvks+E1A2OYRKYXi6CMiXcsL4U1Ud5r_Zw4d8g@mail.gmail.com>
From: Robert Banz <rob@nofocus.org>
Date: Mon, 27 Oct 2014 16:13:22 -0700
Message-ID: <CA+-fWwBgh-mzKFRVhtddZVZz9j8T2fh-M-gpgR+4XmchbW8W1A@mail.gmail.com>
Subject: Re: ZFS errors on the array but not the disk.
To: Zaphod Beeblebrox <zbeeble@gmail.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1
Cc: freebsd-fs <freebsd-fs@freebsd.org>, Steven Hartland <smh@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Oct 2014 23:13:45 -0000

Have you tried different hardware? This screams something's up anywhere in
the stack -- DRAM, cabling, controller...

On Mon, Oct 27, 2014 at 11:34 AM, Zaphod Beeblebrox <zbeeble@gmail.com>
wrote:

> Ok... This is just frustrating.  I've trusted ZFS through many versions ...
> and pretty much ... it's delivered.  There are five symptoms here:
>
> 1. after each reboot, resilver starts again... even if after the resilver I
> complete a full scrub.
>
> 2. seemingly random objects (files, zvols or snapshot items) get marked as
> having errors.  when I say random, to be clear; different items each time.
>
> 3. none of the drives are showing errors in zpool status, neither are they
> chucking errors into dmesg.
>
> 4. errors are being logged against the vdev (only one of the two vdevs) and
> the array (half as many as the vdev).
>
> 5. The activity light for the recently replaced disk does not "flash"
> "with" the others in it's vdev during either resilver or scrub.  This last
> bit might need some explanation. I realize that raidz-1 stripes do not
> always use all the disks, but "generally" the activity lights of the drives
> in a vdev go "together"... In this case, the light of the recently replaced
> drive is off much of the time ...
>
> Is there anything I can/should do?  I pulled the new disk, moved it's
> partitions around (it's larger than the array disks because you can't buy
> 1.5T drives anymore) and then re-added it... so I've tried that.
>
>
> On Fri, Oct 24, 2014 at 11:47 PM, Zaphod Beeblebrox <zbeeble@gmail.com>
> wrote:
>
> > Thanks for the heads up.  I'm following releng/10.1 and 271683 seems to
> be
> > part of that, but a good catch/guess.
> >
> >
> > On Fri, Oct 24, 2014 at 11:02 PM, Steven Hartland <smh@freebsd.org>
> wrote:
> >
> >> There was an issue which would cause resilver restarts fixed by
> *265253* <
> >> https://svnweb.freebsd.org/base?view=revision&revision=265253> which
> was
> >> MFC'ed to stable/10 by *271683* <https://svnweb.freebsd.org/
> >> base?view=revision&revision=271683>so you'll want to make sure your
> >> latter than that.
> >>
> >>
> >> On 24/10/2014 19:42, Zaphod Beeblebrox wrote:
> >>
> >>> I manually replaced a disk... and the array was scrubbed recently.
> >>> Interestingly, I seem to be in the "endless loop"  of resilvering
> >>> problem.
> >>> Not much I can find on it.  but resilvering will complete and I can
> then
> >>> run another scrub.  It will complete, too.  Then rebooting causes
> another
> >>> resilvering.
> >>>
> >>> Another odd data point: it seems as if the things that show up as
> >>> "errors"
> >>> change from resilvering to resilvering.
> >>>
> >>> One bug, it would seem, is that once ZFS has detected an error...
> another
> >>> scrub can reset it, but no attempt is made to read-through the error if
> >>> you
> >>> access the object directly.
> >>>
> >>> On Fri, Oct 24, 2014 at 11:33 AM, Alan Somers <asomers@freebsd.org>
> >>> wrote:
> >>>
> >>>  On Thu, Oct 23, 2014 at 11:37 PM, Zaphod Beeblebrox <
> zbeeble@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> What does it mean when checksum errors appear on the array (and the
> >>>>> vdev)
> >>>>> but not on any of the disks?  See the paste below.  One would think
> >>>>> that
> >>>>> there isn't some ephemeral data stored somewhere that is not one of
> the
> >>>>> disks, yet "cksum" errors show only on the vdev and the array lines.
> >>>>>
> >>>> Help?
> >>>>
> >>>>> [2:17:316]root@virtual:/vr2/torrent/in> zpool status
> >>>>>    pool: vr2
> >>>>>   state: ONLINE
> >>>>> status: One or more devices is currently being resilvered.  The pool
> >>>>> will
> >>>>>          continue to function, possibly in a degraded state.
> >>>>> action: Wait for the resilver to complete.
> >>>>>    scan: resilver in progress since Thu Oct 23 23:11:29 2014
> >>>>>          1.53T scanned out of 22.6T at 62.4M/s, 98h23m to go
> >>>>>          119G resilvered, 6.79% done
> >>>>> config:
> >>>>>
> >>>>>          NAME               STATE     READ WRITE CKSUM
> >>>>>          vr2                ONLINE       0     0    36
> >>>>>            raidz1-0         ONLINE       0     0    72
> >>>>>              label/vr2-d0   ONLINE       0     0     0
> >>>>>              label/vr2-d1   ONLINE       0     0     0
> >>>>>              gpt/vr2-d2c    ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native  (resilvering)
> >>>>>              gpt/vr2-d3b    ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-d4a    ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              ada14          ONLINE       0     0     0
> >>>>>              label/vr2-d6   ONLINE       0     0     0
> >>>>>              label/vr2-d7c  ONLINE       0     0     0
> >>>>>              label/vr2-d8   ONLINE       0     0     0
> >>>>>            raidz1-1         ONLINE       0     0     0
> >>>>>              gpt/vr2-e0     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e1     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e2     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e3     ONLINE       0     0     0
> >>>>>              gpt/vr2-e4     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e5     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e6     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>              gpt/vr2-e7     ONLINE       0     0     0  block size:
> >>>>> 512B
> >>>>> configured, 4096B native
> >>>>>
> >>>>> errors: 43 data errors, use '-v' for a list
> >>>>>
> >>>> The checksum errors will appear on the raidz vdev instead of a leaf if
> >>>> vdev_raidz.c can't determine which leaf vdev was responsible.  This
> >>>> could happen if two or more leaf vdevs return bad data for the same
> >>>> block, which would also lead to unrecoverable data errors.  I see that
> >>>> you have some unrecoverable data errors, so maybe that's what happened
> >>>> to you.
> >>>>
> >>>> Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable
> >>>> to determine which child was responsible for a checksum error.
> >>>> However, I've only seen that happen when a raidz vdev has a mirror
> >>>> child.  That can only happen if the child is a spare or replacing
> >>>> vdev.  Did you activate any spares, or did you manually replace a
> >>>> vdev?
> >>>>
> >>>> -Alan
> >>>>
> >>>>  _______________________________________________
> >>> freebsd-fs@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> freebsd-fs@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> >>
> >
> >
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>