Date: Thu, 18 Apr 2013 18:21:18 GMT From: Nathaniel Filardo <nwf@cs.jhu.edu> To: freebsd-gnats-submit@FreeBSD.org Subject: misc/177966: [zfs] resilver completes but subsequent scrub reports errors Message-ID: <201304181821.r3IILIWr050489@red.freebsd.org> Resent-Message-ID: <201304181830.r3IIU2jg005800@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 177966 >Category: misc >Synopsis: [zfs] resilver completes but subsequent scrub reports errors >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Thu Apr 18 18:30:00 UTC 2013 >Closed-Date: >Last-Modified: >Originator: Nathaniel Filardo >Release: 9.1-STABLE >Organization: >Environment: FreeBSD hydra.priv.oc.ietfng.org 9.1-STABLE FreeBSD 9.1-STABLE #39 r+39eb5ca-dirty: Fri Apr 5 10:46:04 EDT 2013 root@hydra.priv.oc.ietfng.org:/usr/obj/systank/src-git/sys/NWFKERN sparc64 >Description: I took one disk out of a raidz2 pool, and proceeded to run the system for a while on a degraded configuration (but still with redundancy). I then replaced the missing disk (with zpool replace rather than zpool online) and let the system run resilver to completion. It succeeded and reported no errors. Having had bad experiences in the past (http://lists.freebsd.org/pipermail/freebsd-fs/2013-March/016627.html) I ran scrub, which reported 11 checksum errors on the replaced drive, very clearly during the part of the scrub which was walking refcount > 1 blocks. I am currently running another scrub pass, which I hypothesize will succeed without error. The pool, under normal circumstances, looks like this: NAME STATE READ WRITE CKSUM tank0 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada6 ONLINE 0 0 0 ada7 ONLINE 0 0 0 ada9 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada5 ONLINE 0 0 0 ada8 ONLINE 0 0 0 cache ada1a ONLINE 0 0 0 ada0b ONLINE 0 0 0 The pool configuration is pretty default, except that it uses 4K sectors (ashift=12) and the following options are set: tank0 checksum sha256 received tank0 compression gzip received tank0 atime off received tank0 dedup sha256,verify received The deduplication table is pretty sizable: dedup: DDT entries 16754758, size 981 on disk, 158 in core bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 13.0M 1.33T 1.24T 1.27T 13.0M 1.33T 1.24T 1.27T 2 2.35M 198G 165G 172G 5.18M 430G 361G 378G 4 495K 25.4G 13.4G 16.1G 2.24M 114G 61.0G 73.8G 8 121K 1.60G 689M 1.48G 1.28M 16.3G 6.78G 15.5G 16 22.1K 250M 116M 269M 469K 5.04G 2.31G 5.48G 32 4.11K 157M 138M 159M 195K 8.45G 7.65G 8.59G 64 1.53K 9.76M 3.99M 14.8M 124K 897M 375M 1.22G 128 254 6.49M 2.89M 4.60M 41.8K 949M 427M 717M 256 58 582K 100K 519K 19.6K 181M 34.3M 175M 512 27 540K 26K 232K 19.0K 482M 20.7M 167M 1K 12 6K 6K 95.9K 17.9K 8.94M 8.94M 143M 2K 8 648K 13.5K 71.9K 19.9K 1.42G 34.4M 181M 4K 3 256K 129K 144K 17.6K 1.38G 764M 851M 8K 12 644K 8.50K 95.9K 149K 8.97G 110M 1.16G Total 16.0M 1.55T 1.42T 1.45T 22.7M 1.90T 1.67T 1.74T Full DSL scans (scrub, resilver) take about 48 hours each, the first half of which is spent in an incredibly annoyingly slow scan (currently moving about 20 iops/sec and 1Mb/sec) as it works its way through the DDT entries with refcount > 1, after which it ramps up to 35MB/sec as it traverses refcount=1 blocks in disk order. In any case, the scrub after the resilver was clearly in the first such phase of its scan and reported 11 checksum errors all at once (more or less). There were no checksum errors found in the second (refcount=1) phase. If I have to guess, this is possibly a bug in the code which handles entries in the DDT changing their class while a scrub is in progress. >How-To-Repeat: It appears sufficient to be performing I/O traffic to a resilvering pool with deduplication. I will attempt to repeat the experiment as soon as this scrub pass finishes successfully; if it instead finds errors, I will run scrub again. >Fix: >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201304181821.r3IILIWr050489>