From owner-svn-src-stable@freebsd.org Wed Jul 26 17:44:23 2017 Return-Path: Delivered-To: svn-src-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6CB47DABB57; Wed, 26 Jul 2017 17:44:23 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 499496ED65; Wed, 26 Jul 2017 17:44:23 +0000 (UTC) (envelope-from mav@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v6QHiMZi003191; Wed, 26 Jul 2017 17:44:22 GMT (envelope-from mav@FreeBSD.org) Received: (from mav@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v6QHiMVE003190; Wed, 26 Jul 2017 17:44:22 GMT (envelope-from mav@FreeBSD.org) Message-Id: <201707261744.v6QHiMVE003190@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mav set sender to mav@FreeBSD.org using -f From: Alexander Motin Date: Wed, 26 Jul 2017 17:44:22 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r321574 - stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs X-SVN-Group: stable-11 X-SVN-Commit-Author: mav X-SVN-Commit-Paths: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs X-SVN-Commit-Revision: 321574 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for all the -stable branches of the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Jul 2017 17:44:23 -0000 Author: mav Date: Wed Jul 26 17:44:22 2017 New Revision: 321574 URL: https://svnweb.freebsd.org/changeset/base/321574 Log: MFC r319749: MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configuration s illumos/illumos-gate@5b062782532a1d5961c4a4b655906e1238c7c908 https://github.com/illumos/illumos-gate/commit/5b062782532a1d5961c4a4b655906e123 8c7c908 https://www.illumos.org/issues/8005 RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore Reviewed by: Saso Kiselkov Reviewed by: George Wilson Reviewed by: Pavel Zakharov Approved by: Robert Mustacchi Author: Matthew Ahrens Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c Directory Properties: stable/11/ (props changed) Modified: stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c ============================================================================== --- stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c Wed Jul 26 17:42:43 2017 (r321573) +++ stable/11/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c Wed Jul 26 17:44:22 2017 (r321574) @@ -24,7 +24,7 @@ */ /* - * Copyright (c) 2012, 2014 by Delphix. All rights reserved. + * Copyright (c) 2012, 2017 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] */ @@ -681,7 +681,7 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio) /* * Walk backwards through sufficiently contiguous I/Os - * recording the last non-option I/O. + * recording the last non-optional I/O. */ flags = zio->io_flags & ZIO_FLAG_AGG_INHERIT; t = vdev_queue_type_tree(vq, zio->io_type); @@ -704,10 +704,14 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio) /* * Walk forward through sufficiently contiguous I/Os. + * The aggregation limit does not apply to optional i/os, so that + * we can issue contiguous writes even if they are larger than the + * aggregation limit. */ while ((dio = AVL_NEXT(t, last)) != NULL && (dio->io_flags & ZIO_FLAG_AGG_INHERIT) == flags && - IO_SPAN(first, dio) <= zfs_vdev_aggregation_limit && + (IO_SPAN(first, dio) <= zfs_vdev_aggregation_limit || + (dio->io_flags & ZIO_FLAG_OPTIONAL)) && IO_GAP(last, dio) <= maxgap) { last = dio; if (!(last->io_flags & ZIO_FLAG_OPTIONAL)) @@ -739,10 +743,16 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio) } if (stretch) { - /* This may be a no-op. */ + /* + * We are going to include an optional io in our aggregated + * span, thus closing the write gap. Only mandatory i/os can + * start aggregated spans, so make sure that the next i/o + * after our span is mandatory. + */ dio = AVL_NEXT(t, last); dio->io_flags &= ~ZIO_FLAG_OPTIONAL; } else { + /* do not include the optional i/o */ while (last != mandatory && last != first) { ASSERT(last->io_flags & ZIO_FLAG_OPTIONAL); last = AVL_PREV(t, last); @@ -754,7 +764,7 @@ vdev_queue_aggregate(vdev_queue_t *vq, zio_t *zio) return (NULL); size = IO_SPAN(first, last); - ASSERT3U(size, <=, zfs_vdev_aggregation_limit); + ASSERT3U(size, <=, SPA_MAXBLOCKSIZE); abuf = zio_buf_alloc_nowait(size); if (abuf == NULL)