From owner-freebsd-hackers@FreeBSD.ORG Wed Jul 10 09:25:27 2013 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id D675FEB5; Wed, 10 Jul 2013 09:25:27 +0000 (UTC) (envelope-from prvs=1903808b5b=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id BDCB21E65; Wed, 10 Jul 2013 09:25:26 +0000 (UTC) Received: from r2d2 ([82.69.141.170]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50004902843.msg; Wed, 10 Jul 2013 10:25:24 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Wed, 10 Jul 2013 10:25:24 +0100 (not processed: message from valid local sender) X-MDDKIM-Result: neutral (mail1.multiplay.co.uk) X-MDRemoteIP: 82.69.141.170 X-Return-Path: prvs=1903808b5b=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <628C5D1AF6044488B708484203D70B7A@multiplay.co.uk> From: "Steven Hartland" To: =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= , , , References: <86zjtupz3r.fsf@nine.des.no> Subject: Re: Make ZFS use the physical sector size when computing initial ashift Date: Wed, 10 Jul 2013 10:25:36 +0100 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_1133_01CE7D57.D127DB40" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Cc: ivoras@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jul 2013 09:25:28 -0000 This is a multi-part message in MIME format. ------=_NextPart_000_1133_01CE7D57.D127DB40 Content-Type: text/plain; format=flowed; charset="utf-8"; reply-type=original Content-Transfer-Encoding: 8bit Hi DES, unfortunately you need a quite bit more than this to work compatibly. I've had a patch here that does just this for quite some time but there's been some discussion on how we want additional control over this so its not been commited. If others are interested I've attached this as it achieves what we needed here so may also be of use for others too. There's also a big discussion on illumos about this very subject ATM so I'm monitoring that too. Hopefully there will be a nice conclusion come from that how people want to proceed and we'll be able to get a change in that works for everyone. Regards Steve ----- Original Message ----- From: "Dag-Erling Smørgrav" To: ; Cc: Sent: Wednesday, July 10, 2013 10:02 AM Subject: Make ZFS use the physical sector size when computing initial ashift The attached patch causes ZFS to base the minimum transfer size for a new vdev on the GEOM provider's stripesize (physical sector size) rather than sectorsize (logical sector size), provided that stripesize is a power of two larger than sectorsize and smaller than or equal to VDEV_PAD_SIZE. This should eliminate the need for ivoras@'s gnop trick when creating ZFS pools on Advanced Format drives. DES -- Dag-Erling Smørgrav - des@des.no -------------------------------------------------------------------------------- > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk. ------=_NextPart_000_1133_01CE7D57.D127DB40 Content-Type: application/octet-stream; name="zzz-zfs-ashift-fix.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="zzz-zfs-ashift-fix.patch" Changes zfs zpool initial / desired ashift to be based off stripesize=0A= instead of sectorsize making it compatible with drives marked with=0A= the 4k sector size quirk.=0A= =0A= Without the correct min block size BIO_DELETE requests passed to=0A= a large number of current SSD's via TRIM don't actually perform=0A= any LBA TRIM so its vital for the correct operation of TRIM to get=0A= the correct min block size.=0A= =0A= To do this we added the additional dashift (desired ashift) to=0A= vdev_open_func_t calls. This was needed as just updating ashift to=0A= be based off stripesize would mean that a devices reported minimum=0A= transfer size (ashift) could increase and that in turn would cause=0A= member devices to be unusable and hence break pools with error=0A= ZFS-8000-5E.=0A= =0A= The global minimum ashift used for new zpools can now also be=0A= tuned using the vfs.zfs.min_create_ashift sysctl. This defaults=0A= to 12 (4096 byte blocks) in order to optimise for newer disks which=0A= are migrating from 512 to 4096 byte sectors.=0A= =0A= The value of vfs.zfs.min_create_ashift is limited to min of=0A= SPA_MINBLOCKSHIFT (9) and a max of SPA_MAXBLOCKSHIFT (17).=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c.orig = 2011-06-06 09:36:46.000000000 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_disk.c = 2012-11-02 14:47:55.293668071 +0000=0A= @@ -32,6 +32,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for disks.=0A= */=0A= @@ -103,7 +105,7 @@=0A= }=0A= =0A= static int=0A= -vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_disk_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= spa_t *spa =3D vd->vdev_spa;=0A= vdev_disk_t *dvd;=0A= @@ -284,7 +286,7 @@=0A= }=0A= =0A= /*=0A= - * Determine the device's minimum transfer size.=0A= + * Determine the device's minimum and desired transfer size.=0A= * If the ioctl isn't supported, assume DEV_BSIZE.=0A= */=0A= if (ldi_ioctl(dvd->vd_lh, DKIOCGMEDIAINFOEXT, (intptr_t)&dkmext,=0A= @@ -292,6 +294,7 @@=0A= dkmext.dki_pbsize =3D DEV_BSIZE;=0A= =0A= *ashift =3D highbit(MAX(dkmext.dki_pbsize, SPA_MINBLOCKSIZE)) - 1;=0A= + *dashift =3D highbit(MAX(dkmext.dki_pbsize, (1ULL << zfs_min_ashift))) = - 1;=0A= =0A= /*=0A= * Clear the nowritecache bit, so that on a vdev_reopen() we will=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c.orig = 2012-01-05 22:31:25.000000000 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c = 2012-11-02 14:47:38.252107541 +0000=0A= @@ -30,6 +30,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for files.=0A= */=0A= @@ -47,7 +49,7 @@=0A= }=0A= =0A= static int=0A= -vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_file_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= vdev_file_t *vf;=0A= vnode_t *vp;=0A= @@ -127,6 +129,7 @@=0A= =0A= *psize =3D vattr.va_size;=0A= *ashift =3D SPA_MINBLOCKSHIFT;=0A= + *dashift =3D zfs_min_ashift;=0A= =0A= return (0);=0A= }=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c.orig = 2012-11-02 12:20:15.918986181 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c = 2012-11-02 14:47:48.135273692 +0000=0A= @@ -36,6 +36,8 @@=0A= #include =0A= #include =0A= =0A= +extern int zfs_min_ashift;=0A= +=0A= /*=0A= * Virtual device vector for GEOM.=0A= */=0A= @@ -408,7 +410,7 @@=0A= }=0A= =0A= static int=0A= -vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= struct g_provider *pp;=0A= struct g_consumer *cp;=0A= @@ -494,9 +496,10 @@=0A= *psize =3D pp->mediasize;=0A= =0A= /*=0A= - * Determine the device's minimum transfer size.=0A= + * Determine the device's minimum and desired transfer size.=0A= */=0A= *ashift =3D highbit(MAX(pp->sectorsize, SPA_MINBLOCKSIZE)) - 1;=0A= + *dashift =3D highbit(MAX(pp->stripesize, (1ULL << zfs_min_ashift))) - = 1;=0A= =0A= /*=0A= * Clear the nowritecache settings, so that on a vdev_reopen()=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c.orig = 2012-07-03 11:49:22.342245151 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c = 2012-07-03 11:58:02.161948585 +0000=0A= @@ -127,7 +127,7 @@=0A= }=0A= =0A= static int=0A= -vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_mirror_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, = uint64_t *dashift)=0A= {=0A= int numerrors =3D 0;=0A= int lasterror =3D 0;=0A= @@ -150,6 +150,7 @@=0A= =0A= *asize =3D MIN(*asize - 1, cvd->vdev_asize - 1) + 1;=0A= *ashift =3D MAX(*ashift, cvd->vdev_ashift);=0A= + *dashift =3D MAX(*dashift, cvd->vdev_dashift);=0A= }=0A= =0A= if (numerrors =3D=3D vd->vdev_children) {=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c.orig = 2012-07-03 11:49:10.545275865 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c = 2012-07-03 11:58:07.670470640 +0000=0A= @@ -40,7 +40,7 @@=0A= =0A= /* ARGSUSED */=0A= static int=0A= -vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift)=0A= +vdev_missing_open(vdev_t *vd, uint64_t *psize, uint64_t *ashift, = uint64_t *dashift)=0A= {=0A= /*=0A= * Really this should just fail. But then the root vdev will be in the=0A= @@ -50,6 +50,7 @@=0A= */=0A= *psize =3D 0;=0A= *ashift =3D 0;=0A= + *dashift =3D 0;=0A= return (0);=0A= }=0A= =0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c.orig = 2012-07-03 11:49:03.675875505 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c = 2012-07-03 11:58:15.334806334 +0000=0A= @@ -1447,7 +1447,7 @@=0A= }=0A= =0A= static int=0A= -vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= vdev_t *cvd;=0A= uint64_t nparity =3D vd->vdev_nparity;=0A= @@ -1476,6 +1476,7 @@=0A= =0A= *asize =3D MIN(*asize - 1, cvd->vdev_asize - 1) + 1;=0A= *ashift =3D MAX(*ashift, cvd->vdev_ashift);=0A= + *dashift =3D MAX(*dashift, cvd->vdev_dashift);=0A= }=0A= =0A= *asize *=3D vd->vdev_children;=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c.orig = 2012-07-03 11:49:27.901760380 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c = 2012-07-03 11:58:19.704427068 +0000=0A= @@ -50,7 +50,7 @@=0A= }=0A= =0A= static int=0A= -vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)=0A= +vdev_root_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift, uint64_t = *dashift)=0A= {=0A= int lasterror =3D 0;=0A= int numerrors =3D 0;=0A= @@ -78,6 +78,7 @@=0A= =0A= *asize =3D 0;=0A= *ashift =3D 0;=0A= + *dashift =3D 0;=0A= =0A= return (0);=0A= }=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c.orig = 2012-10-22 20:41:50.234005351 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c 2012-10-22 = 20:42:16.355805894 +0000=0A= @@ -1125,6 +1125,7 @@=0A= uint64_t osize =3D 0;=0A= uint64_t asize, psize;=0A= uint64_t ashift =3D 0;=0A= + uint64_t dashift =3D 0;=0A= =0A= ASSERT(vd->vdev_open_thread =3D=3D curthread ||=0A= spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) =3D=3D = SCL_STATE_ALL);=0A= @@ -1154,7 +1155,7 @@=0A= return (ENXIO);=0A= }=0A= =0A= - error =3D vd->vdev_ops->vdev_op_open(vd, &osize, &ashift);=0A= + error =3D vd->vdev_ops->vdev_op_open(vd, &osize, &ashift, &dashift);=0A= =0A= /*=0A= * Reset the vdev_reopening flag so that we actually close=0A= @@ -1255,14 +1256,16 @@=0A= */=0A= vd->vdev_asize =3D asize;=0A= vd->vdev_ashift =3D MAX(ashift, vd->vdev_ashift);=0A= + vd->vdev_dashift =3D MAX(dashift, vd->vdev_dashift);=0A= } else {=0A= /*=0A= * Make sure the alignment requirement hasn't increased.=0A= */=0A= if (ashift > vd->vdev_top->vdev_ashift) {=0A= + printf("ZFS ashift open failure of %s (%ld > %ld)\n", vd->vdev_path, = ashift, vd->vdev_top->vdev_ashift);=0A= vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,=0A= VDEV_AUX_BAD_LABEL);=0A= return (EINVAL);=0A= }=0A= }=0A= =0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c.orig = 2012-11-05 15:27:52.092194343 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_label.c = 2012-11-05 15:53:26.449021023 +0000=0A= @@ -145,9 +145,12 @@=0A= #include =0A= =0A= static boolean_t vdev_trim_on_init =3D B_TRUE;=0A= +static boolean_t vdev_dashift_enable =3D B_TRUE;=0A= SYSCTL_DECL(_vfs_zfs_vdev);=0A= SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, trim_on_init, CTLFLAG_RW,=0A= &vdev_trim_on_init, 0, "Enable/disable full vdev trim on = initialisation");=0A= +SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, optimal_ashift, CTLFLAG_RW,=0A= + &vdev_dashift_enable, 0, "Enable/disable optimal ashift usage on = initialisation");=0A= =0A= /*=0A= * Basic routines to read and write from a vdev label.=0A= @@ -282,6 +285,16 @@=0A= vd->vdev_ms_array) =3D=3D 0);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT,=0A= vd->vdev_ms_shift) =3D=3D 0);=0A= + /*=0A= + * We use the max of ashift and dashift (the desired/optimal=0A= + * ashift), which is typically the stripesize of a device, to=0A= + * ensure we get the best performance from underlying devices.=0A= + * =0A= + * Its done here as it should only ever have an effect on new=0A= + * zpool creation.=0A= + */=0A= + if (vdev_dashift_enable)=0A= + vd->vdev_ashift =3D MAX(vd->vdev_ashift, vd->vdev_dashift);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT,=0A= vd->vdev_ashift) =3D=3D 0);=0A= VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE,=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h.orig = 2012-10-22 20:40:08.361577293 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h = 2012-10-22 21:02:52.447781800 +0000=0A= @@ -55,7 +55,7 @@=0A= /*=0A= * Virtual device operations=0A= */=0A= -typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t = *ashift);=0A= +typedef int vdev_open_func_t(vdev_t *vd, uint64_t *size, uint64_t = *ashift, uint64_t *dashift);=0A= typedef void vdev_close_func_t(vdev_t *vd);=0A= typedef uint64_t vdev_asize_func_t(vdev_t *vd, uint64_t psize);=0A= typedef int vdev_io_start_func_t(zio_t *zio);=0A= @@ -119,6 +119,7 @@=0A= uint64_t vdev_asize; /* allocatable device capacity */=0A= uint64_t vdev_min_asize; /* min acceptable asize */=0A= uint64_t vdev_ashift; /* block alignment shift */=0A= + uint64_t vdev_dashift; /* desired blk alignment shift */=0A= uint64_t vdev_state; /* see VDEV_STATE_* #defines */=0A= uint64_t vdev_prevstate; /* used when reopening a vdev */=0A= vdev_ops_t *vdev_ops; /* vdev operations */=0A= --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c.orig = 2012-11-02 14:56:29.474248887 +0000=0A= +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c 2012-11-03 = 01:27:28.066912403 +0000=0A= @@ -41,6 +41,30 @@=0A= #include =0A= #include =0A= =0A= +#define ZFS_MIN_ASHIFT SPA_MINBLOCKSHIFT=0A= +/*=0A= + * Max ashift - limited by how labels are accessed by zio_read_phys = using offsets=0A= + * within vdev_label_t=0A= + *=0A= + * If label access is fixed to work with ashift properly then the max = should be=0A= + * set to SPA_MAXBLOCKSHIFT=0A= + */=0A= +#define ZFS_MAX_ASHIFT 13=0A= +/*=0A= + * Optimum ashift - defaults to 12 which results in a min block size of = 4096 as=0A= + * this is the optimum value for newer disks which are migrating from = 512 to 4096=0A= + * byte sectors=0A= + */=0A= +#define ZFS_OPTIMUM_ASHIFT 12 =0A= +=0A= +/*=0A= + * Minimum ashift used when creating new pools=0A= + *=0A= + * This can be tuned using the sysctl vfs.zfs.min_create_ashift but is = limited=0A= + * to a min of ZFS_MIN_ASHIFT and a max of ZFS_MAX_ASHIFT=0A= + * =0A= + */=0A= +int zfs_min_ashift =3D MAX(SPA_MINBLOCKSHIFT, ZFS_OPTIMUM_ASHIFT);=0A= int zfs_no_write_throttle =3D 0;=0A= int zfs_write_limit_shift =3D 3; /* 1/8th of physical memory */=0A= int zfs_txg_synctime_ms =3D 1000; /* target millisecs to sync a txg */=0A= @@ -54,6 +78,9 @@=0A= =0A= static pgcnt_t old_physmem =3D 0;=0A= =0A= +#ifdef _KERNEL=0A= +static int min_ashift_sysctl(SYSCTL_HANDLER_ARGS);=0A= +=0A= SYSCTL_DECL(_vfs_zfs);=0A= TUNABLE_INT("vfs.zfs.no_write_throttle", &zfs_no_write_throttle);=0A= SYSCTL_INT(_vfs_zfs, OID_AUTO, no_write_throttle, CTLFLAG_RDTUN,=0A= @@ -78,6 +105,32 @@=0A= TUNABLE_QUAD("vfs.zfs.write_limit_override", &zfs_write_limit_override);=0A= SYSCTL_QUAD(_vfs_zfs, OID_AUTO, write_limit_override, CTLFLAG_RDTUN,=0A= &zfs_write_limit_override, 0, "");=0A= +SYSCTL_PROC(_vfs_zfs, OID_AUTO, min_create_ashift, CTLTYPE_INT | = CTLFLAG_RW,=0A= + &zfs_min_ashift, 0, min_ashift_sysctl, "I",=0A= + "Minimum ashift used when creating new pools");=0A= +=0A= +static int=0A= +min_ashift_sysctl(SYSCTL_HANDLER_ARGS)=0A= +{=0A= + int error, value;=0A= +=0A= + value =3D *(int *)arg1;=0A= +=0A= + error =3D sysctl_handle_int(oidp, &value, 0, req);=0A= +=0A= + if ((error !=3D 0) || (req->newptr =3D=3D NULL))=0A= + return (error);=0A= +=0A= + if (value < ZFS_MIN_ASHIFT)=0A= + value =3D ZFS_MIN_ASHIFT;=0A= + else if (value > ZFS_MAX_ASHIFT)=0A= + value =3D ZFS_MAX_ASHIFT;=0A= +=0A= + *(int *)arg1 =3D value;=0A= +=0A= + return (0);=0A= +}=0A= +#endif=0A= =0A= int=0A= dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t = **ddp)=0A= ------=_NextPart_000_1133_01CE7D57.D127DB40--