From owner-freebsd-bugs@FreeBSD.ORG Sun May 5 18:00:00 2013 Return-Path: Delivered-To: freebsd-bugs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9EAF9C33 for ; Sun, 5 May 2013 18:00:00 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 852EE792 for ; Sun, 5 May 2013 18:00:00 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r45I00uW018480 for ; Sun, 5 May 2013 18:00:00 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r45I00O6018479; Sun, 5 May 2013 18:00:00 GMT (envelope-from gnats) Resent-Date: Sun, 5 May 2013 18:00:00 GMT Resent-Message-Id: <201305051800.r45I00O6018479@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Nathaniel Filardo Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id CDA31AB1 for ; Sun, 5 May 2013 17:51:31 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from oldred.FreeBSD.org (oldred.freebsd.org [8.8.178.121]) by mx1.freebsd.org (Postfix) with ESMTP id A5877755 for ; Sun, 5 May 2013 17:51:31 +0000 (UTC) Received: from oldred.FreeBSD.org ([127.0.1.6]) by oldred.FreeBSD.org (8.14.5/8.14.5) with ESMTP id r45HpV5x020981 for ; Sun, 5 May 2013 17:51:31 GMT (envelope-from nobody@oldred.FreeBSD.org) Received: (from nobody@localhost) by oldred.FreeBSD.org (8.14.5/8.14.5/Submit) id r45HpVkq020980; Sun, 5 May 2013 17:51:31 GMT (envelope-from nobody) Message-Id: <201305051751.r45HpVkq020980@oldred.FreeBSD.org> Date: Sun, 5 May 2013 17:51:31 GMT From: Nathaniel Filardo To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-3.1 Subject: kern/178349: zfs scrub on deduped data could be much less seeky X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 May 2013 18:00:00 -0000 >Number: 178349 >Category: kern >Synopsis: zfs scrub on deduped data could be much less seeky >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: change-request >Submitter-Id: current-users >Arrival-Date: Sun May 05 18:00:00 UTC 2013 >Closed-Date: >Last-Modified: >Originator: Nathaniel Filardo >Release: 9.1-STABLE >Organization: >Environment: FreeBSD hydra.priv.oc.ietfng.org 9.1-STABLE FreeBSD 9.1-STABLE #46 r+c68cdd0-dirty: Tue Apr 23 22:59:02 EDT 2013 root@hydra.priv.oc.ietfng.org:/usr/obj/systank/src-git/sys/NWFKERN sparc64 >Description: ZFS tries to save time in scrubbing by visiting data in most-referenced-to-least-referenced order (so that it need not visit a block once for each reference to it): in short, it scans the DDT for all blocks with refcount >1 and then walks the on-disk tree to visit refcount==1 blocks. Unfortunately, the first phase is apparently prone to being very seeky, resulting in agonizingly slow scrubs and resilvers (my disks all get 18-25 ops/sec during this phase, for a grand total of ~1.5MB/sec from my raidz2; later traversals are much more respectable at 35MB/sec or so). It would be better, I think, if the scrub logic traversed the DDT with a measure of on-disk locality (though this will, naturally, take several passes to visit all blocks). A straightforward way to do this, though by no means necessarily the best, would be to allocate in RAM a fixed-size sorted queue of visited block pointers and ignore block pointers that fell outside the min and max of this queue (rather like the HAMMER2 lazy deduplication logic, amusingly enough). Upon visiting a block pointer, it would be inserted into the queue and may displace a higher address (which will be unnecessarily revisited later, but that's OK), but will thereby restrict this pass to a narrower region of the disk, reducing the number of long-distance seeks. When a pass over the DDT has finished, if the queue's max is still infinity, no additional passes are needed; otherwise, the max of the queue should be made the min, the max should be reset to infinity, and another pass over the DDT should be made. The current bookmarking scheme is sufficient to resume this game, as well, I think, with the understanding that all blocks in the DDT whose on-disk location is greater than the bookmark are still due for scan (i.e. when resuming, use the bookmark as the min of the queue and initialize the max to infinity). It may make sense, rather than tracking exact block pointers in the queue, to mask off some number of bits from the bottom of their addresses and track those values instead. >How-To-Repeat: >Fix: >Release-Note: >Audit-Trail: >Unformatted: