From owner-freebsd-bugs@FreeBSD.ORG Fri Oct 26 08:50:01 2012 Return-Path: Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 7EE33807 for ; Fri, 26 Oct 2012 08:50:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.FreeBSD.org [8.8.178.135]) by mx1.freebsd.org (Postfix) with ESMTP id 533348FC16 for ; Fri, 26 Oct 2012 08:50:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q9Q8o1X6081974 for ; Fri, 26 Oct 2012 08:50:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q9Q8o13U081973; Fri, 26 Oct 2012 08:50:01 GMT (envelope-from gnats) Resent-Date: Fri, 26 Oct 2012 08:50:01 GMT Resent-Message-Id: <201210260850.q9Q8o13U081973@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Steven Hartland Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F0C49732 for ; Fri, 26 Oct 2012 08:48:00 +0000 (UTC) (envelope-from nobody@FreeBSD.org) Received: from red.freebsd.org (red.freebsd.org [IPv6:2001:4f8:fff6::22]) by mx1.freebsd.org (Postfix) with ESMTP id D64B98FC12 for ; Fri, 26 Oct 2012 08:48:00 +0000 (UTC) Received: from red.freebsd.org (localhost [127.0.0.1]) by red.freebsd.org (8.14.5/8.14.5) with ESMTP id q9Q8m0nd002688 for ; Fri, 26 Oct 2012 08:48:00 GMT (envelope-from nobody@red.freebsd.org) Received: (from nobody@localhost) by red.freebsd.org (8.14.5/8.14.5/Submit) id q9Q8m09e002687; Fri, 26 Oct 2012 08:48:00 GMT (envelope-from nobody) Message-Id: <201210260848.q9Q8m09e002687@red.freebsd.org> Date: Fri, 26 Oct 2012 08:48:00 GMT From: Steven Hartland To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-3.1 Subject: kern/173115: Teach ZFS about geom stripe size so zpools are created with optimum ashift (patch included) X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Oct 2012 08:50:01 -0000 >Number: 173115 >Category: kern >Synopsis: Teach ZFS about geom stripe size so zpools are created with optimum ashift (patch included) >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Oct 26 08:50:01 UTC 2012 >Closed-Date: >Last-Modified: >Originator: Steven Hartland >Release: 8.3-RELEASE >Organization: Multiplay >Environment: FreeBSD dev 8.3-RELEASE-p4 FreeBSD 8.3-RELEASE-p4 #22: Mon Sep 17 17:18:32 UTC 2012 root@dev:/usr/obj/usr/src/sys/MULTIPLAY amd64 >Description: Changes zfs zpool initial / desired ashift to be based off stripesize instead of sectorsize making it compatible with drives marked with the 4k sector size quirk. Without the correct min block size BIO_DELETE requests passed to a large number of current SSD's via TRIM don't actually perform any LBA TRIM so its vital for the correct operation of TRIM to get the correct min block size. To do this we added the additional dashift (desired ashift) to vdev_open_func_t calls. This was needed as just updating ashift to be based off stripesize would mean that a devices reported minimum transfer size (ashift) could increase and that in turn would cause member devices to be unusable and hence break pools with error ZFS-8000-5E. >How-To-Repeat: N/A >Fix: Apply the attached patch Patch attached with submission follows: Changes zfs zpool initial / desired ashift to be based off stripesize instead of sectorsize making it compatible with drives marked with the 4k sector size quirk. Without the correct min block size BIO_DELETE requests passed to a large number of current SSD's via TRIM don't actually perform any LBA TRIM so its vital for the correct operation of TRIM to get the correct min block size. To do this we added the additional dashift (desired ashift) to vdev_open_func_t calls. This was needed as just updating ashift to be based off stripesize would mean that a devices reported minimum transfer size (ashift) could increase and that in turn would cause member devices to be unusable and hence break pools with error ZFS-8000-5E. --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c.orig 2012-07-03 11:49:34.103219588 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c 2012-07-03 12:51:44.521525471 +0000 @@ -103,7 +103,7 @@ } static int -vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) +vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t *dashift) { spa_t *spa = vd->vdev_spa; vdev_disk_t *dvd; @@ -284,7 +284,7 @@ } /* - * Determine the device's minimum transfer size. + * Determine the device's minimum and desired transfer size. * If the ioctl isn't supported, assume DEV_BSIZE. */ if (ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT, (intptr_t)&dkmext, @@ -292,6 +292,7 @@ dkmext.dki_pbsize = DEV_BSIZE; *ashift = highbit(MAX(dkmext.dki_pbsize, SPA_MINBLOCKSIZE)) - 1; + *dashift = *ashift; /* * Clear the nowritecache bit, so that on a vdev_reopen() we will --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c.orig 2012-07-03 11:48:42.314740333 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c 2012-07-03 11:57:22.579381320 +0000 @@ -47,7 +47,7 @@ } static int -vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) +vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t *dashift) { vdev_file_t *vf; vnode_t *vp; @@ -127,6 +127,7 @@ *psize = vattr.va_size; *ashift = SPA_MINBLOCKSHIFT; + *dashift = SPA_MINBLOCKSHIFT; return (0); } --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c.orig 2012-07-03 12:50:50.158161825 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c 2012-07-03 12:52:01.408085155 +0000 @@ -416,7 +416,7 @@ } static int -vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) +vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t *dashift) { struct g_provider *pp; struct g_consumer *cp; @@ -502,9 +502,10 @@ *psize = pp->mediasize; /* - * Determine the device's minimum transfer size. + * Determine the device's minimum and desired transfer size. */ *ashift = highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1; + *dashift = highbit(MAX(pp->stripesize, SPA_MINBLOCKSIZE)) - 1; /* * Clear the nowritecache settings, so that on a vdev_reopen() --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c.orig 2012-07-03 11:49:22.342245151 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c 2012-07-03 11:58:02.161948585 +0000 @@ -127,7 +127,7 @@ } static int -vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift) +vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t *dashift) { int numerrors = 0; int lasterror = 0; @@ -150,6 +150,7 @@ *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1; *ashift = MAX(*ashift, cvd->vdev_ashift); + *dashift = MAX(*dashift, cvd->vdev_dashift); } if (numerrors == vd->vdev_children) { --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c.orig 2012-07-03 11:49:10.545275865 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c 2012-07-03 11:58:07.670470640 +0000 @@ -40,7 +40,7 @@ /* ARGSUSED */ static int -vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift) +vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t *dashift) { /* * Really this should just fail. But then the root vdev will be in the @@ -50,6 +50,7 @@ */ *psize = 0; *ashift = 0; + *dashift = 0; return (0); } --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c.orig 2012-07-03 11:49:03.675875505 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c 2012-07-03 11:58:15.334806334 +0000 @@ -1447,7 +1447,7 @@ } static int -vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift) +vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t *dashift) { vdev_t *cvd; uint64_t nparity = vd->vdev_nparity; @@ -1476,6 +1476,7 @@ *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1; *ashift = MAX(*ashift, cvd->vdev_ashift); + *dashift = MAX(*dashift, cvd->vdev_dashift); } *asize *= vd->vdev_children; --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c.orig 2012-07-03 11:49:27.901760380 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c 2012-07-03 11:58:19.704427068 +0000 @@ -50,7 +50,7 @@ } static int -vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift) +vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t *dashift) { int lasterror = 0; int numerrors = 0; @@ -78,6 +78,7 @@ *asize = 0; *ashift = 0; + *dashift = 0; return (0); } --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c.orig 2012-10-22 20:41:50.234005351 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c 2012-10-22 20:42:16.355805894 +0000 @@ -1125,6 +1125,7 @@ uint64_t osize = 0; uint64_t asize, psize; uint64_t ashift = 0; + uint64_t dashift = 0; ASSERT(vd->vdev_open_thread == curthread || spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); @@ -1154,7 +1155,7 @@ return (ENXIO); } - error = vd->vdev_ops->vdev_op_open(vd, &osize, &ashift); + error = vd->vdev_ops->vdev_op_open(vd, &osize, &ashift, &dashift); /* * Reset the vdev_reopening flag so that we actually close @@ -1255,14 +1256,16 @@ */ vd->vdev_asize = asize; vd->vdev_ashift = MAX(ashift, vd->vdev_ashift); + vd->vdev_dashift = MAX(dashift, vd->vdev_dashift); } else { /* * Make sure the alignment requirement hasn't increased. */ if (ashift > vd->vdev_top->vdev_ashift) { + printf("ZFS ashift open failure of %s (%ld > %ld)\n", vd->vdev_path, ashift, vd->vdev_top->vdev_ashift); vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN, VDEV_AUX_BAD_LABEL); return (EINVAL); } } --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c.orig 2012-10-22 18:51:42.000000000 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c 2012-10-22 20:47:00.098927000 +0000 @@ -277,6 +277,15 @@ vd->vdev_ms_array) == 0); VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT, vd->vdev_ms_shift) == 0); + /* + * We use the max of ashift and dashift (the desired ashift), + * which is typically the stripesize of a device, to ensure we + * get the best performance from underlying devices. + * + * Its done here as it should only ever have an effect on new + * zpool creation. + */ + vd->vdev_ashift = MAX(vd->vdev_ashift, vd->vdev_dashift); VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT, vd->vdev_ashift) == 0); VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE, --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h.orig 2012-10-22 20:40:08.361577293 +0000 +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h 2012-10-22 21:02:52.447781800 +0000 @@ -55,7 +55,7 @@ /* * Virtual device operations */ -typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t *ashift); +typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t *ashift, uint64_t *dashift); typedef void vdev_close_func_t(vdev_t *vd); typedef uint64_t vdev_asize_func_t(vdev_t *vd, uint64_t psize); typedef int vdev_io_start_func_t(zio_t *zio); @@ -119,6 +119,7 @@ uint64_t vdev_asize; /* allocatable device capacity */ uint64_t vdev_min_asize; /* min acceptable asize */ uint64_t vdev_ashift; /* block alignment shift */ + uint64_t vdev_dashift; /* desired blk alignment shift */ uint64_t vdev_state; /* see VDEV_STATE_* #defines */ uint64_t vdev_prevstate; /* used when reopening a vdev */ vdev_ops_t *vdev_ops; /* vdev operations */ >Release-Note: >Audit-Trail: >Unformatted: