Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 Jun 2012 09:48:10 +0200 (CEST)
From:      dirk.meyer@dinoex.sub.org
To:        FreeBSD-gnats-submit@FreeBSD.org
Subject:   kern/169319: zfs resilver can't complete
Message-ID:  <201206220748.q5M7mAai051853@home8.dinoex.sub.de>
Resent-Message-ID: <201206220750.q5M7o2Gt027614@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         169319
>Category:       kern
>Synopsis:       zfs resilver can't complete
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Jun 22 07:50:02 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Dirk Meyer
>Release:        FreeBSD 9.0-RELEASE
>Organization:
privat
>Environment:

	FreeBSD 9.0-RELEASE-p1	GENERIC and GENERIC + DTRACE	
	ZFS raidz

>Description:

tank was created with 3 disks.
one disk was replaced and resilved completed.
tank was extended wuth 3 new disks.
one disk was replaced and resilved does not complete.
resilver keeps restarting roughly every 3 minutes.
No entry in dmesg or syslog.
smartctl -a on all disks reprts no problems.

This Problem was reported mutiple times:
http://forums.freebsd.org/showthread.php?t=27878
http://lists.freebsd.org/pipermail/freebsd-fs/2008-December/005532.html
http://kerneltrap.org/mailarchive/freebsd-fs/2008/12/27/4498564
http://forums.freenas.org/showthread.php?5008-ZFS-raidz1-resilver-restarts-after-1-min
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/42740
http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/49144



 zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sun Jun 17 22:17:19 2012
    52,6M scanned out of 3,38T at 1,64M/s, 599h21m to go
    7,27M resilvered, 0,00% done
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz1-0                  ONLINE       0     0     0
            ada5p1                  ONLINE       0     0     0
            ada4p1                  ONLINE       0     0     0
            ada3p1                  ONLINE       0     0     0
          raidz1-1                  DEGRADED     0     0     0
            ada1p1                  ONLINE       0     0     0
            ada0p1                  ONLINE       0     0     0
            replacing-2             UNAVAIL      0     0     0
              16718347834481227056  UNAVAIL      0     0     0  was /dev/ada6p1/old
              ada6p1                ONLINE       0     0     0  (resilvering)

errors: No known data errors

>How-To-Repeat:

$ zpool history -i
2011-10-30.06:33:44 zpool create tank raidz1 ada1p1 ada2p1 ada3p1
(disk fail)
2011-11-29.09:38:28 zpool replace tank ada2p1 ada2p1
(reording of devices here)
2012-01-27.11:21:50 zpool add tank raidz1 ada0p1 ada1p1 ada3p1
(disk fail)
2012-06-16.11:44:50 zpool scrub tank
2012-06-17.22:17:32 zpool replace tank ada6p1 ada6p1
2012-06-17.22:55:26 [internal pool scrub done txg:3885584] complete=0
2012-06-17.22:55:26 [internal pool scrub txg:3885584] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:07:09 [internal pool scrub done txg:3885689] complete=0
2012-06-17.23:07:09 [internal pool scrub txg:3885689] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:11:29 [internal pool scrub done txg:3885725] complete=0
2012-06-17.23:11:29 [internal pool scrub txg:3885725] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:14:16 [internal pool scrub done txg:3885748] complete=0
2012-06-17.23:14:16 [internal pool scrub txg:3885748] func=2 mintxg=3 maxtxg=3885254
2012-06-17.23:56:03 [internal pool scrub done txg:3886098] complete=0
[...]


>Fix:

	No fix or workaround known.

1) detach the faulty disk did not help:

[...]
2012-06-19.20:58:11 [internal pool scrub done txg:3910984] complete=0
2012-06-19.20:58:11 [internal pool scrub txg:3910984] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:11:03 [internal pool scrub done txg:3911106] complete=0
2012-06-19.21:11:03 [internal pool scrub txg:3911106] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:13:26 [internal pool scrub done txg:3911128] complete=0
2012-06-19.21:13:26 [internal pool scrub txg:3911128] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:13:48 zpool offline tank 16718347834481227056
2012-06-19.21:16:15 [internal vdev detach txg:3911155] vdev=/dev/ada6p1/old
2012-06-19.21:16:15 zpool detach tank 16718347834481227056
2012-06-19.21:19:17 [internal pool scrub done txg:3911183] complete=0
2012-06-19.21:19:17 [internal pool scrub txg:3911183] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:20:18 [internal pool scrub done txg:3911192] complete=0
2012-06-19.21:20:18 [internal pool scrub txg:3911192] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:30:33 [internal pool scrub done txg:3911291] complete=0
2012-06-19.21:30:33 [internal pool scrub txg:3911291] func=2 mintxg=3 maxtxg=3885254
2012-06-19.21:34:32 [internal pool scrub done txg:3911328] complete=0
2012-06-19.21:34:32 [internal pool scrub txg:3911328] func=2 mintxg=3 maxtxg=3885254
[...]

2) removing the disk and scrub the drive did not help.

The scrub did complete without error, see "complete=1"

[...]
2012-06-20.06:46:38 [internal pool scrub done txg:3916369] complete=0
2012-06-20.06:46:38 [internal pool scrub txg:3916369] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:52:29 [internal pool scrub done txg:3916425] complete=0
2012-06-20.06:52:29 [internal pool scrub txg:3916425] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:57:04 zpool offline tank ada6p1
2012-06-20.07:11:35 zpool online tank 4320346087595370573
2012-06-20.09:03:19 [internal pool scrub done txg:3917680] complete=1
2012-06-20.17:36:49 zpool online tank ada6p1
2012-06-20.18:57:32 [internal pool scrub txg:3924817] func=2 mintxg=3 maxtxg=3924817
2012-06-20.18:57:43 [internal vdev attach txg:3924819] replace vdev=/dev/ada6p1 for vdev=/dev/ada6p1/old
2012-06-20.18:57:43 zpool replace tank ada6p1 ada6p1
2012-06-20.18:58:50 [internal pool scrub done txg:3924829] complete=0
2012-06-20.18:58:50 [internal pool scrub txg:3924829] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:00:12 [internal pool scrub done txg:3924843] complete=0
2012-06-20.19:00:13 [internal pool scrub txg:3924843] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:01:38 zpool offline tank ada6p1
2012-06-20.21:13:30 [internal pool scrub done txg:3926107] complete=1
2012-06-20.21:20:44 [internal pool scrub txg:3926198] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:20:57 zpool online tank ada6p1
2012-06-20.21:24:49 [internal pool scrub done txg:3926239] complete=0
2012-06-20.21:24:50 [internal pool scrub txg:3926239] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:27:49 [internal pool scrub done txg:3926266] complete=0
2012-06-20.21:27:49 [internal pool scrub txg:3926266] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:29:59 [internal pool scrub done txg:3926285] complete=0
2012-06-20.21:29:59 [internal pool scrub txg:3926285] func=2 mintxg=1616210 maxtxg=3926195
[...]

yy3) cleaning drive and replace again does not help.

[...]
2012-06-20.06:46:38 [internal pool scrub done txg:3916369] complete=0
2012-06-20.06:46:38 [internal pool scrub txg:3916369] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:52:29 [internal pool scrub done txg:3916425] complete=0
2012-06-20.06:52:29 [internal pool scrub txg:3916425] func=2 mintxg=3 maxtxg=3885254
2012-06-20.06:57:04 zpool offline tank ada6p1
2012-06-20.07:11:35 zpool online tank 4320346087595370573
2012-06-20.09:03:19 [internal pool scrub done txg:3917680] complete=1
2012-06-20.17:36:49 zpool online tank ada6p1
2012-06-20.18:57:32 [internal pool scrub txg:3924817] func=2 mintxg=3 maxtxg=3924817
2012-06-20.18:57:43 [internal vdev attach txg:3924819] replace vdev=/dev/ada6p1 for vdev=/dev/ada6p1/old
2012-06-20.18:57:43 zpool replace tank ada6p1 ada6p1
2012-06-20.18:58:50 [internal pool scrub done txg:3924829] complete=0
2012-06-20.18:58:50 [internal pool scrub txg:3924829] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:00:12 [internal pool scrub done txg:3924843] complete=0
2012-06-20.19:00:13 [internal pool scrub txg:3924843] func=2 mintxg=3 maxtxg=3924817
2012-06-20.19:01:38 zpool offline tank ada6p1
2012-06-20.21:13:30 [internal pool scrub done txg:3926107] complete=1
2012-06-20.21:20:44 [internal pool scrub txg:3926198] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:20:57 zpool online tank ada6p1
2012-06-20.21:24:49 [internal pool scrub done txg:3926239] complete=0
2012-06-20.21:24:50 [internal pool scrub txg:3926239] func=2 mintxg=1616210 maxtxg=3926195
2012-06-20.21:27:49 [internal pool scrub done txg:3926266] complete=0
[...]

4) debugging with a DTRACE enable kernel:

dtrace -q -n '::zfs_dbgmsg:entry{printf("%s;%d\n", stringof(arg0), arg1)}'

[...]
spa=%s async request task=%u;-2198978129920
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931639
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;368
visited %llu blocks in %llums;9349
restarting resilver txg=%llu;3931640
internal %s pool:%s txg:%llu %s;-2123208961
spa=%s async request task=%u;-2198978129920
restarting scan func=%u txg=%llu;2
internal %s pool:%s txg:%llu %s;-2123209268
doing scan sync txg %llu; ddt bm=%llu/%llu/%llu/%llx;3931640
scanned %llu ddt entries with class_max = %u; pausing=%u;0
visited %llu blocks in %llums;2386
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931641
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931642
visited %llu blocks in %llums;1548
visited %llu blocks in %llums;1223
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931643
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;18
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;21
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;234
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;558593
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931644
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;355513
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931645
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;233292
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931646
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931647
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;183353
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;208309
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931648
spa=%s async request task=%u;-2198978129920
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
visited %llu blocks in %llums;47458
spa=%s async request task=%u;-2198978129920
doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu;3931649
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;44
scanned dataset %llu (%s) with min=%llu max=%llu; pausing=%u;368
visited %llu blocks in %llums;21488
restarting resilver txg=%llu;3931650
internal %s pool:%s txg:%llu %s;-2123208961
spa=%s async request task=%u;-2198978129920
restarting scan func=%u txg=%llu;2
[...]

>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201206220748.q5M7mAai051853>