From owner-freebsd-virtualization@freebsd.org Thu Mar 21 01:47:44 2019 Return-Path: Delivered-To: freebsd-virtualization@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0FEF81546522 for ; Thu, 21 Mar 2019 01:47:44 +0000 (UTC) (envelope-from mike.gerdts@joyent.com) Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0F8328CD49 for ; Thu, 21 Mar 2019 01:47:41 +0000 (UTC) (envelope-from mike.gerdts@joyent.com) Received: by mail-pf1-x42e.google.com with SMTP id 8so3235327pfr.4 for ; Wed, 20 Mar 2019 18:47:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joyent.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=K2z6GqCzoNkhrL9ZT2QTRKiHADBD+R2KKvTNGe9bPCs=; b=M84TeUV1V8hpC/Nv7BmSFxkrmLW16pDFVBHoFyNoQCUfRQ7ew2oxWHC8IBJmGW/MFp ZX3+62kFWKSyqqwzRLctVyTqHf4wIS+AVTFDs1fiaVYeXfXDVb7vWH0/2NL9CJxCkBz5 DaIUVRFAKBUmqNtJ1kB9YW7jZkwMMGwllVXmw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=K2z6GqCzoNkhrL9ZT2QTRKiHADBD+R2KKvTNGe9bPCs=; b=CZf2iuuIWPQaYiWARrpI2oQVbphoiqJs37tFeMRvMihqO2gEFdH661F0q/h15HrFJU QNZKj8x8FwOhL8b0hLIv3uAeeSWgm3iip96ZL0/4j7vCUqgFzRbgAtVqgzS5jNfPJV4X zeDuXfjB71iTeh2umo4+JFT0rttdjC+5kB3L6ZLAXPSf8nBF5hDcTeI4QNvJLNK8Bn5p dYKHOVSL0wA9Y4wLKn9RijL4Ma2qSdT2F43p8zHwbfHgpb84B/OQ7dbg/ArdbZ21n4tf Bbl/Nr+uTTTRpXpykRcwlT3mKQNpE1QTI/ba8z+aiwFd1MgePe41HCDYICzuqx9uizu3 IZEg== X-Gm-Message-State: APjAAAUnYyvrdI+i+/Ib9w9/eQOgxsvD/2qevHvEXSaiolnT0OuyEEk3 NyQVdXhxTU311EViSePLjcGeFfrEf0Af3tFsRlWRYHIQKEE= X-Google-Smtp-Source: APXvYqyj1XoFf0aXdNGgy/dYHzAe4gyE+MRxMt2OiDq00JhgxzoDiaWY78hB6kPJ5VGHEKY4gUatjR90IwIYJLVG+bo= X-Received: by 2002:a65:50c4:: with SMTP id s4mr1008041pgp.33.1553132860604; Wed, 20 Mar 2019 18:47:40 -0700 (PDT) MIME-Version: 1.0 References: <20190319024638.GA8193@admin.sibptus.ru> In-Reply-To: From: Mike Gerdts Date: Wed, 20 Mar 2019 20:47:29 -0500 Message-ID: Subject: Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor To: "Patrick M. Hausen" Cc: Victor Sudakov , freebsd-virtualization@freebsd.org X-Rspamd-Queue-Id: 0F8328CD49 X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=pass header.d=joyent.com header.s=google header.b=M84TeUV1; spf=pass (mx1.freebsd.org: domain of mike.gerdts@joyent.com designates 2607:f8b0:4864:20::42e as permitted sender) smtp.mailfrom=mike.gerdts@joyent.com X-Spamd-Result: default: False [-6.30 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; R_DKIM_ALLOW(-0.20)[joyent.com:s=google]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-virtualization@freebsd.org]; DMARC_NA(0.00)[joyent.com]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[joyent.com:+]; MX_GOOD(-0.01)[alt1.aspmx.l.google.com,aspmx.l.google.com,aspmx2.googlemail.com,alt2.aspmx.l.google.com,aspmx3.googlemail.com]; RCVD_IN_DNSWL_NONE(0.00)[e.2.4.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; NEURAL_HAM_SHORT(-0.91)[-0.907,0]; IP_SCORE(-2.88)[ip: (-9.40), ipnet: 2607:f8b0::/32(-2.81), asn: 15169(-2.12), country: US(-0.07)]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Mar 2019 01:47:44 -0000 On Tue, Mar 19, 2019 at 3:07 AM Patrick M. Hausen wrote: > Hi! > > > Am 19.03.2019 um 03:46 schrieb Victor Sudakov : > > 1. Does ARC actually cache zfs volumes (not files/datasets)? > > Yes it does. > > > 2. If ARC does cache volumes, does this cache make sense on a hyperviso= r, > > because guest OSes will probably have their own disk cache anyway. > > IMHO not much, because the guest OS is relying on the fact that when > it writes it=E2=80=99s own cached data out to =E2=80=9Edisk=E2=80=9C, it = will be committed to > stable storage. > I'd recommend caching at least metadata (primarycache=3Dmetadata). The gue= st will not cache zfs metadata and not having metadata in the cache can lead to a big hit in performance. The metadata in question here are things like block pointers that keep track of where the data is at - zfs can't find the data without metadata. I think the key decision as to whether you use primarycache=3Dmetadata or primarycache=3Dall comes down to whether you are after predictable performance or optimal performance. You will likely get worse performance with primarycache=3Dmetaadata (or especially with primarycache=3Dnone), presuming the host has RAM to spare. As you pack the system with more VMs or allocate more disk to existing VMs, you will probably find that primarycache=3Dmetadata leads steadier performance regardless of how much storage is used or how active other VMs are. > > 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of > > total RAM, so that most RAM is available to guest machines? > > Yes if you build your own solution on plain FreeBSD. No if you are runnin= g > FreeNAS which already tries to autotune the ARC size according to the > memory committed to VMs. > > > 4. What other zfs tuning measures can you suggest for a bhyve > > hypervisor? > > e.g. > zfs set sync=3Dalways zfs/vm > > if zfs/vm is the dataset under which you create the ZVOLs for your emulat= ed > disks. > I'm not sure what the state of this is in FreeBSD, but in SmartOS we allow the guests to benefit from write caching if they negotiate FLUSH. Guests that do negotiate flush are expected to use proper barriers to flush the cache at critical times. When a FLUSH arrives, SmartOS bhyve issues an fsync(). To be clear - SmartOS bhyve is not actually caching writes in memory, it is just delaying transaction group commits. This avoids significant write inflation and associated latency. Support for FLUSH negotiation has greatly improved I/O performance - to the point that some tests show parity with running directly on the host pool. If not already in FreeBSD, this would probably be something of relatively high value to pull in. If you do go the route of sync=3Dalways and primarycache=3D,= be sure your guest block size and host volblocksize match. ZFS (on platforms I'm more familiar with, at least) defaults to volblocksize=3D8k. Most gues= t file systems these days seem to default to a block size of 4 KiB. If the guest file system issues a 4 KiB aligned write, that will turn into a read-modify-write cycle to stitch that 4 KiB guest block into the host's 8 KB block. If the adjacent guest block that is in the same 8 KiB host block is written in the next write, it will also turn into a read-modify-write cycle. If you are using ZFS in the guest, this can be particularly problematic because the guest ZFS will align writes with the guest pool's ashift, not with a guest dataset's recordsize or volblocksize. I discovered this in an extended benchmarking of zfs-on-zfs - primarily with primarycache=3Dmetadat= a and sync=3Dalways. The write inflation was quite significant: 3x was common. I tracked some of this down to alignment issues and part of it was due to sync writes causing the data to be written twice. George Wilson has a great talk where he describes the same issues I hit. https://www.youtube.com/watch?v=3D_-QAnKtIbGc I've mentioned write inflation related to sync writes a few times. One point that I think is poorly understood is that when ZFS is rushed into committing a write with fsync or similar, the immediate write is of ZIL blocks to the intent log. The intent log can be on a separate disk (slog, log=3D) or it can be on the disks that hold the pool's data. When the intent log is on the data disks, the data is written to the same disks multiple times: once as ZIL blocks and once as data blocks. Between these writes there will be full-disk head movement as the uberblocks are updated at the beginning and end of the disk. What I say above is based on experience with kernel zones on Solaris and bhyve on SmartOS. There are enough similarities that I expect bhyve on FreeBSD will be the same, but FreeBSD may have some strange-to-me zfs caching changes. Regards, Mike