Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Nov 2016 11:16:36 +0100
From:      Jan Bramkamp <crest@rlwinm.de>
To:        freebsd-emulation@freebsd.org
Subject:   Re: bhyve: zvols for guest disk - yes or no?
Message-ID:  <5be68f57-c9c5-7c20-f590-1beed55fd6bb@rlwinm.de>
In-Reply-To: <D5A6875B-A2AE-4DD9-B941-71146AEF2578@punkt.de>
References:  <D991D88D-1327-4580-B6E5-2D59338147C0@punkt.de> <b775f684-98a2-b929-2b13-9753c95fd4f2@rlwinm.de> <D5A6875B-A2AE-4DD9-B941-71146AEF2578@punkt.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 16/11/2016 19:10, Patrick M. Hausen wrote:
>> Without ZFS you would require a reliable hardware RAID controller (if such a magical creature exists) instead (or build a software RAID1+0 from gmirror and gstripe). IMO money is better invested into more RAM keeping ZFS and the admin happy.
>
> And we always use geom_mirror with UFS ...

That would work but I don't recommend for new setups. ZFS offers you a 
lot of operation which in my opinion alone is worth the overhead. 
Without ZFS you would have to use either large raw image files in UFS or 
fight with an old fashioned volume manager.

> Thanks again, will go the ZFS route, set up the system with the
> 16 GB RAM it has, then upgrade to 32 in a week or two.
>
> The plan is to put around 10 VMs with 2-4 G of configured
> memory on that system.
>
> bhyve doesn't do page deduplication like ESXi does, yet - right?

Bhyve doesn't support page deduplication but it doesn't wire down guest 
memory unless you asked it to, but you have to wire down guest memory to 
use PCI passthrough. If I were picking VM hosts today I would go with 
LGA 2011 v3 boards having least eight DDR4 slots per socket. Maybe use 
some nice >= 2TB NVMe SSDs and suddenly your limited by CPU cycles and 
storage space instead of IOPS and RAM.

Use the bhyve AHCI emulation instead of virtio block devices because the 
emulated SATA disks support TRIM which translate into BIO_DELETE on the 
ZVOLs releasing the backing storage while keeping what ever reservations 
are in place. This reduces pressure on the ZFS allocator and reduces 
backup/replication times.

An other thing I learned the hard way is that ZVOL are set in stone at 
the ZVOL creation. You have to (cam)dd everything to change the block 
size. The default ZVOL block size is 8K which isn't wrong but your 
guests need to align their file systems (and swap) correctly or you'll 
suffer from write amplification. And ZFS RAID-Z really sucks for such 
small block sizes. Use mirrored VDEVs in your pools or you will suffer 
from massive metadata overhead and disappointing IOPS.

Finally if your ZFS remote replication seems to slow the cause is 
probably trying to force a bursty stream without enough buffering 
through a TCP connection. ZFS send doesn't buffer the replication stream 
internally and block on write to stdout letting the source Zpool idle. 
ZFS receive blocks on reads from stdin and writes to the destination 
pool. If either ZFS send or ZFS receive block on disk I/O the other will 
block very soon as well. Something the misc/buffer port with a few 
hundred MiB memory buffer of 128KiB blocks on each side solves this 
problem. Sometimes with this optimization ZFS replications are *too* 
fast and starve normal workloads of disk or network I/O bandwidth.

-- Jan Bramkamp



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5be68f57-c9c5-7c20-f590-1beed55fd6bb>