From owner-freebsd-stable@freebsd.org Tue Apr 9 21:27:38 2019 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E66EE156E66D for ; Tue, 9 Apr 2019 21:27:37 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: from mail-it1-x12e.google.com (mail-it1-x12e.google.com [IPv6:2607:f8b0:4864:20::12e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 068B982FFD for ; Tue, 9 Apr 2019 21:27:37 +0000 (UTC) (envelope-from zbeeble@gmail.com) Received: by mail-it1-x12e.google.com with SMTP id k64so7496364itb.5 for ; Tue, 09 Apr 2019 14:27:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gnika/KUM4T2HRI9MKk32EAyvsr/9vmnp+bZ4VoOi2E=; b=hOjSzYv/bpZQmBimqOCOaKsaODQ8sx8m7L69MDqpBneqhJFSMplLcZ5e+MhoB77I48 X44RVTfzpxFRU/41EhnFpSefZAeY6raoaeCzJXJEmv7PUWMAoN1bUCdmVvwc5aP1p6yN 420XwTNs3LCr+cOxl0Lha75J1JbiPqRWJf5pZnLTOsGRlOMJsw0JPzDjl2eLJvFBz2M8 h3OLgpdC7NlG5ovUov/YOI4xx96Ql1NVTXLKsKx8zmI1q1NUFymWTWBeQ5OyTbkyPf3F ygTC/VMiRGdarjLHQvZzaDVbn9GJWlzqAveQhhi3WXHL7xnKVWP6I7lfk1LuK6plMPkO 6FaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gnika/KUM4T2HRI9MKk32EAyvsr/9vmnp+bZ4VoOi2E=; b=S8kNUjWT2DS0u1YugxUApFWZiLy05tMdSh0IlMKgr1tlBoAwRPca8L7LDpiE2uNUlj 2MCwzir+xUtKdp48XiYkhJk2GxLXy9tGongO0lcAepTwEVlMCiR7qMBg3OmAVG4/ju0U UsNjwCtFmZJf8KUoq66cP/HKYSwkooHeJC3gDmLCPmNRFm/OAHF/FgR2Ge4QweKgeT3J 4ua6ZJYe37qWwMY7G0o3HmVezhkrIDF3ot8mSsjxZ7HUKwJFD1gtpmKZsLdjkKPHppWa yv5hjpt2k1KSHPwqufwOdY901qmoo0pCLXOt3TL7HFMsmzzQ9MICvCwFNzZTqVhCnh5N f9UQ== X-Gm-Message-State: APjAAAWzPBZIILzjojGrQQE0JValr8BbDxqWaJzkfPvjz9SAXjH775kb +vmLSWJEAScQN++pKDYrOxOxA76W2n4jEFln+7ey X-Google-Smtp-Source: APXvYqySmeMV+tPGPvcg/n72qML9bV1bk7jiDS9G3497AW/NqEnUIs/DIXDtw9fifyTdPXWL1HPoNQzcElHINKLQX9k= X-Received: by 2002:a05:6638:1a:: with SMTP id z26mr26728870jao.99.1554845256237; Tue, 09 Apr 2019 14:27:36 -0700 (PDT) MIME-Version: 1.0 References: <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> In-Reply-To: <9a96b1b5-9337-fcae-1a2a-69d7bb24a5b3@denninger.net> From: Zaphod Beeblebrox Date: Tue, 9 Apr 2019 17:27:29 -0400 Message-ID: Subject: Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) To: Karl Denninger Cc: FreeBSD Stable X-Rspamd-Queue-Id: 068B982FFD X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20161025 header.b=hOjSzYv/; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of zbeeble@gmail.com designates 2607:f8b0:4864:20::12e as permitted sender) smtp.mailfrom=zbeeble@gmail.com X-Spamd-Result: default: False [-6.78 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; FREEMAIL_FROM(0.00)[gmail.com]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.97)[-0.973,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; MX_GOOD(-0.01)[cached: alt3.gmail-smtp-in.l.google.com]; RCVD_TLS_LAST(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[e.2.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; IP_SCORE(-2.80)[ip: (-8.82), ipnet: 2607:f8b0::/32(-2.95), asn: 15169(-2.18), country: US(-0.06)]; DWL_DNSWL_NONE(0.00)[gmail.com.dwl.dnswl.org : 127.0.5.0] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Apr 2019 21:27:38 -0000 I have a "Ghetto" home RAID array. It's built on compromises and makes use of RAID-Z2 to survive. It consists of two plexes of 8x 4T units of "spinning rust". It's been upgraded and upgraded. It started as 8x 2T, then 8x 2T + 8x 4T then the current 16x 4T. The first 8 disks are connected to motherboard SATA. IIRC, there are 10. Two ports are used for a mirror that it boots from. There's also an SSD in there somhow, so it might be 12 ports on the motherboard. The other 8 disks started life in eSATA port multiplier boxes. That was doubleplusungood, so I got a RAID card based on LSI pulled from a fujitsu server in Japan. That's been upgraded a couple of times... not always a good experience. One problem is that cheap or refurbished drives don't always "like" SAS controllers and FreeBSD. YMMV. Anyways, this is all to introduce the fact that I've seen this behaviour multiple times. You have a drive that leaves the array for some amount of time, and after resilvering, a scrub will find a small amount of bad data. 32 k or 40k or somesuch. In my cranial schema of things, I've chalked it up to out-of-order writing of the drives ... or other such behavior s.t. ZFS doesn't know exactly what has been written. I've often wondered if the fix would be to add an amount of fuzz to the transaction range that is resilvered. On Tue, Apr 9, 2019 at 4:32 PM Karl Denninger wrote: > On 4/9/2019 15:04, Andriy Gapon wrote: > > On 09/04/2019 22:01, Karl Denninger wrote: > >> the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S > >> IN USE AREA was examined, compared, and blocks not on the "new member" > >> or changed copied over. > > I think that that's not entirely correct. > > ZFS maintains something called DTL, a dirty-time log, for a missing / > offlined / > > removed device. When the device re-appears and gets resilvered, ZFS > walks only > > those blocks that were born within the TXG range(s) when the device was > missing. > > > > In any case, I do not have an explanation for what you are seeing. > > That implies something much more-serious could be wrong such as given > enough time -- a week, say -- that the DTL marker is incorrect and some > TXGs that were in fact changed since the OFFLINE are not walked through > and synchronized. That would explain why it gets caught by a scrub -- > the resilver is in fact not actually copying all the blocks that got > changed and so when you scrub the blocks are not identical. Assuming > the detached disk is consistent that's not catastrophically bad IF > CAUGHT; where you'd get screwed HARD is in the situation where (for > example) you had a 2-unit mirror, detached one, re-attached it, resilver > says all is well, there is no scrub performed and then the > *non-detached* disk fails before there is a scrub. In that case you > will have permanently destroyed or corrupted data since the other disk > is allegedly consistent but there are blocks *missing* that were never > copied over. > > Again this just showed up on 12.x; it definitely was *not* at issue in > 11.1 at all. I never ran 11.2 in production for a material amount of > time (I went from 11.1 to 12.0 STABLE after the IPv6 fixes were posted > to 12.x) so I don't know if it is in play on 11.2 or not. > > I'll see if it shows up again with 20.00.07.00 card firmware. > > Of note I cannot reproduce this on my test box with EITHER 19.00.00.00 > or 20.00.07.00 firmware when I set up a 3-unit mirror, offline one, make > a crap-ton of changes, offline the second and reattach the third (in > effect mirroring the "take one to the vault" thing) with a couple of > hours elapsed time and a synthetic (e.g. "dd if=/dev/random of=outfile > bs=1m" sort of thing) "make me some new data that has to be resilvered" > workload. I don't know if that's because I need more entropy in the > filesystem than I can reasonably generate this way (e.g. more > fragmentation of files, etc) or whether it's a time-based issue (e.g. > something's wrong with the DTL/TXG thing as you note above in terms of > how it functions and it only happens if the time elapsed causes > something to be subject to a rollover or similar problem.) > > I spent quite a lot of time trying to make reproduce the issue on my > "sandbox" machine and was unable -- and of note it is never a large > quantity of data that is impacted, it's usually only a couple of dozen > checksums that show as bad and fixed. Of note it's also never just one; > if there was a single random hit on a data block due to ordinary bitrot > sort of issues I'd expect only one checksum to be bad. But generating a > realistic synthetic workload over the amount of time involved on a > sandbox is not trivial at all; the system on which this is now happening > handles a lot of email and routine processing of various sorts including > a fair bit of database activity associated with network monitoring and > statistical analysis. > > I'm assuming that using "offline" as a means to do this hasn't become > "invalid" as something that's considered "ok" as a means of doing this > sort of thing.... it certainly has worked perfectly well for a very long > time! > > -- > Karl Denninger > karl@denninger.net > /The Market Ticker/ > /[S/MIME encrypted email preferred]/ >