Date: Thu, 17 Nov 2016 11:16:36 +0100 From: Jan Bramkamp <crest@rlwinm.de> To: freebsd-emulation@freebsd.org Subject: Re: bhyve: zvols for guest disk - yes or no? Message-ID: <5be68f57-c9c5-7c20-f590-1beed55fd6bb@rlwinm.de> In-Reply-To: <D5A6875B-A2AE-4DD9-B941-71146AEF2578@punkt.de> References: <D991D88D-1327-4580-B6E5-2D59338147C0@punkt.de> <b775f684-98a2-b929-2b13-9753c95fd4f2@rlwinm.de> <D5A6875B-A2AE-4DD9-B941-71146AEF2578@punkt.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On 16/11/2016 19:10, Patrick M. Hausen wrote: >> Without ZFS you would require a reliable hardware RAID controller (if such a magical creature exists) instead (or build a software RAID1+0 from gmirror and gstripe). IMO money is better invested into more RAM keeping ZFS and the admin happy. > > And we always use geom_mirror with UFS ... That would work but I don't recommend for new setups. ZFS offers you a lot of operation which in my opinion alone is worth the overhead. Without ZFS you would have to use either large raw image files in UFS or fight with an old fashioned volume manager. > Thanks again, will go the ZFS route, set up the system with the > 16 GB RAM it has, then upgrade to 32 in a week or two. > > The plan is to put around 10 VMs with 2-4 G of configured > memory on that system. > > bhyve doesn't do page deduplication like ESXi does, yet - right? Bhyve doesn't support page deduplication but it doesn't wire down guest memory unless you asked it to, but you have to wire down guest memory to use PCI passthrough. If I were picking VM hosts today I would go with LGA 2011 v3 boards having least eight DDR4 slots per socket. Maybe use some nice >= 2TB NVMe SSDs and suddenly your limited by CPU cycles and storage space instead of IOPS and RAM. Use the bhyve AHCI emulation instead of virtio block devices because the emulated SATA disks support TRIM which translate into BIO_DELETE on the ZVOLs releasing the backing storage while keeping what ever reservations are in place. This reduces pressure on the ZFS allocator and reduces backup/replication times. An other thing I learned the hard way is that ZVOL are set in stone at the ZVOL creation. You have to (cam)dd everything to change the block size. The default ZVOL block size is 8K which isn't wrong but your guests need to align their file systems (and swap) correctly or you'll suffer from write amplification. And ZFS RAID-Z really sucks for such small block sizes. Use mirrored VDEVs in your pools or you will suffer from massive metadata overhead and disappointing IOPS. Finally if your ZFS remote replication seems to slow the cause is probably trying to force a bursty stream without enough buffering through a TCP connection. ZFS send doesn't buffer the replication stream internally and block on write to stdout letting the source Zpool idle. ZFS receive blocks on reads from stdin and writes to the destination pool. If either ZFS send or ZFS receive block on disk I/O the other will block very soon as well. Something the misc/buffer port with a few hundred MiB memory buffer of 128KiB blocks on each side solves this problem. Sometimes with this optimization ZFS replications are *too* fast and starve normal workloads of disk or network I/O bandwidth. -- Jan Bramkamp
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5be68f57-c9c5-7c20-f590-1beed55fd6bb>