From owner-freebsd-current@FreeBSD.ORG Mon Mar 3 09:17:50 2014 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A457491C; Mon, 3 Mar 2014 09:17:50 +0000 (UTC) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id A46F5D8E; Mon, 3 Mar 2014 09:17:49 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id LAA02694; Mon, 03 Mar 2014 11:17:41 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1WKP0G-000NZ2-Om; Mon, 03 Mar 2014 11:17:40 +0200 Message-ID: <53144891.9050001@FreeBSD.org> Date: Mon, 03 Mar 2014 11:17:05 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: freebsd-current@FreeBSD.org Subject: Re: ZFS secondarycache on SSD problem on r255173 References: <20131016080100.GA27758@hell.ukr.net> <3A44A8F6-8B62-4A23-819D-B91A3E6E5EF9@freebsd.org> <7059AA6DCC0D46B8B1D33FC883C31643@multiplay.co.uk> <20131017061248.GA15980@hell.ukr.net> <326B470C65A04BC4BC83E118185B935F@multiplay.co.uk> <20131017073925.GA34958@hell.ukr.net> <2AFE1CBD9B124E3AB9E05A4E483CCE03@multiplay.co.uk> <20131018080148.GA75226@hell.ukr.net> <256B2E5A0BA44DCBB45BB3F3E820E190@multiplay.co.uk> <20131018144524.GA30018@hell.ukr.net> <4459A6FAB7B8445C97CCB9EFF34FD4F0@multiplay.co.uk> In-Reply-To: <4459A6FAB7B8445C97CCB9EFF34FD4F0@multiplay.co.uk> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "Justin T. Gibbs" , Steven Hartland X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Mar 2014 09:17:50 -0000 on 18/10/2013 17:57 Steven Hartland said the following: > I think we we may well need the following patch to set the minblock > size based on the vdev ashift and not SPA_MINBLOCKSIZE. > > svn diff -x -p sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c > Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c > =================================================================== > --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 256554) > +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (working copy) > @@ -5147,7 +5147,7 @@ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr) > len = l2hdr->b_asize; > cdata = zio_data_buf_alloc(len); > csize = zio_compress_data(ZIO_COMPRESS_LZ4, l2hdr->b_tmp_cdata, > - cdata, l2hdr->b_asize, (size_t)SPA_MINBLOCKSIZE); > + cdata, l2hdr->b_asize, (size_t)(1ULL << > l2hdr->b_dev->l2ad_vdev->vdev_ashift)); > > if (csize == 0) { > /* zero block, indicate that there's nothing to write */ This is a rather old thread and change, but I think that I have identified another problem with 4KB cache devices. I noticed that on some of our systems we were getting a clearly abnormal number of l2arc checksum errors accounted in l2_cksum_bad. The hardware appeared to be in good health. Using DTrace I noticed that the data seemed to be overwritten with other data. After more DTrace analysis I observed that sometimes l2arc_write_buffers() would advance l2ad_hand by more than target_sz. This meant that l2arc_write_buffers() would write beyond a region cleared by l2arc_evict() and thus overwrite data belonging to non-evicted buffers. Havoc ensues. The cache devices in question are all SSDs with logical sector size of 4KB. I am not sure about other ZFS platforms, but on FreeBSD this fact is detected and ashift of 12 is used for the cache vdevs. Looking at l2arc_write_buffers() code you can see that it properly accounts for ashift when actually writing buffers and advancing l2ad_hand: /* * Keep the clock hand suitably device-aligned. */ buf_p_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz); write_psize += buf_p_sz; dev->l2ad_hand += buf_p_sz; But the same is not done when selecting buffers to be written and checking that target_sz is not exceeded. So, if ARC contains a lot of buffers smaller than 4K that means that an aligned on-disk size of the L2ARC buffers could be quite larger than their non-aligned size. I propose the following patch which has been tested and seems to fix the problem without introducing any side effects: https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size.diff https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size -- Andriy Gapon