Date: Thu, 29 Feb 2024 00:02:45 +0300 From: Vitaliy Gusev <gusev.vitaliy@gmail.com> To: Matthew Grooms <mgrooms@shrew.net> Cc: virtualization@freebsd.org Subject: Re: bhyve disk performance issue Message-ID: <3850080E-EBD1-4414-9C4E-DD89611C9F58@gmail.com> In-Reply-To: <b353b39a-56d3-4757-a607-3c612944b509@shrew.net> References: <6a128904-a4c1-41ec-a83d-56da56871ceb@shrew.net> <28ea168c-1211-4104-b8b4-daed0e60950d@app.fastmail.com> <0ff6f30a-b53a-4d0f-ac21-eaf701d35d00@shrew.net> <6f6b71ac-2349-4045-9eaf-5c50d42b89be@shrew.net> <50614ea4-f0f9-44a2-b5e6-ebb33cfffbc4@shrew.net> <6a4e7e1d-cca5-45d4-a268-1805a15d9819@shrew.net> <f01a9bca-7023-40c0-93f2-8cdbe4cd8078@tubnor.net> <edb80fff-561b-4dc5-95ee-204e0c6d95df@shrew.net> <a07d070b-4dc1-40c9-bc80-163cd59a5bfc@Duedinghausen.eu> <e45c95df-4858-48aa-a274-ba1bf8e599d5@shrew.net> <BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com> <25ddf43d-f700-4cb5-af2a-1fe669d1e24b@shrew.net> <1DAEB435-A613-4A04-B63F-D7AF7A0B7C0A@gmail.com> <b353b39a-56d3-4757-a607-3c612944b509@shrew.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--Apple-Mail=_8118B541-F589-4E79-AF6C-3E98D8AADC93 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 28 Feb 2024, at 23:03, Matthew Grooms <mgrooms@shrew.net> wrote: >=20 > ... > The virtual disks were provisioned with either a 128G disk image or a = 1TB raw partition, so I don't think space was an issue. > Trim is definitely not an issue. I'm using a tiny fraction of the 32TB = array have tried both heavily under-provisioned HW RAID10 and SW RAID10 = using GEOM. The latter was tested after sending full trim resets to all = drives individually. >=20 It could be then TRIM/UNMAP is not used, zvol (for the instance) becomes = full for the while. ZFS considers it as all blocks are used and write = operations could have troubles. I believe it was recently fixed. Also look at this one: GuestFS->UNMAP->bhyve->Host-FS->PhysicalDisk The problem of UNMAP that it could have unpredictable slowdown at any = time. So I would suggest to check results with enabled and disabled = UNMAP in a guest. > I will try to incorporate the rest of your feedback into my next round = of testing. If I can find a benchmark tool that works with a raw block = device, that would be ideal. >=20 >=20 Use =E2=80=9Cdd=E2=80=9D as the first step for read testing; ~# dd if=3D/dev/nvme0n2 of=3D/dev/null bs=3D1M status=3Dprogress = flag=3Ddirect ~# dd if=3D/dev/nvme0n2 of=3D/dev/null bs=3D1M status=3Dprogress Compare results with directio and without. =E2=80=9Cfio=E2=80=9D tool.=20 =20 1) write prepare: ~# fio --name=3Dprep --rw=3Dwrite --verify=3Dcrc32 --loop=3D1 = --numjobs=3D2 --time_based --thread --bs=3D1M --iodepth=3D32 = --ioengine=3Dlibaio --direct=3D1 --group_reporting --size=3D20G = --filename=3D/dev/nvme0n2 2) read test: ~# fio --name=3Dreadtest --rw=3Dread =E2=80=94loop=3D30 = --numjobs=3D2 --time_based --thread =E2=80=94bs=3D256K --iodepth=3D32 = --ioengine=3Dlibaio --direct=3D1 --group_reporting --size=3D20G = --filename=3D/dev/nvme0n2 =20 =E2=80=94 Vitaliy =20 > Thanks, >=20 > -Matthew >=20 >=20 >=20 >> =E2=80=94=E2=80=94 >> Vitaliy >>=20 >>> On 28 Feb 2024, at 21:29, Matthew Grooms <mgrooms@shrew.net> = <mailto:mgrooms@shrew.net> wrote: >>>=20 >>> On 2/27/24 04:21, Vitaliy Gusev wrote: >>>> Hi, >>>>=20 >>>>=20 >>>>> On 23 Feb 2024, at 18:37, Matthew Grooms <mgrooms@shrew.net> = <mailto:mgrooms@shrew.net> wrote: >>>>>=20 >>>>>> ... >>>>> The problem occurs when an image file is used on either ZFS or = UFS. The problem also occurs when the virtual disk is backed by a raw = disk partition or a ZVOL. This issue isn't related to a specific = underlying filesystem. >>>>>=20 >>>>=20 >>>> Do I understand right, you ran testing inside VM inside guest VM = on ext4 filesystem? If so you should be aware about additional overhead = in comparison when you were running tests on the hosts. >>>>=20 >>> Hi Vitaliy, >>>=20 >>> I appreciate you providing the feedback and suggestions. I spent = over a week trying as many combinations of host and guest options as = possible to narrow this issue down to a specific host storage or a guest = device model option. Unfortunately the problem occurred with every = combination I tested while running Linux as the guest. Note, I only = tested RHEL8 & RHEL9 compatible distributions ( Alma & Rocky ). The = problem did not occur when I ran FreeBSD as the guest. The problem did = not occur when I ran KVM in the host and Linux as the guest. >>>=20 >>>> I would suggest to run fio (or even dd) on raw disk device inside = VM, i.e. without filesystem at all. Just do not forget do =E2=80=9Cecho = 3 > /proc/sys/vm/drop_caches=E2=80=9D in Linux Guest VM before you run = tests.=20 >>> The two servers I was using to test with are are no longer = available. However, I'll have two more identical servers arriving in the = next week or so. I'll try to run additional tests and report back here. = I used bonnie++ as that was easily installed from the package repos on = all the systems I tested. >>>=20 >>>>=20 >>>> Could you also give more information about: >>>>=20 >>>> 1. What results did you get (decode bonnie++ output)? >>> If you look back at this email thread, there are many examples of = running bonnie++ on the guest. I first ran the tests on the host system = using Linux + ext4 and FreeBSD 14 + UFS & ZFS to get a baseline of = performance. Then I ran bonnie++ tests using bhyve as the hypervisor and = Linux & FreeBSD as the guest. The combination of host and guest storage = options included ... >>>=20 >>> 1) block device + virtio blk >>> 2) block device + nvme >>> 3) UFS disk image + virtio blk >>> 4) UFS disk image + nvme >>> 5) ZFS disk image + virtio blk >>> 6) ZFS disk image + nvme >>> 7) ZVOL + virtio blk >>> 8) ZVOL + nvme >>>=20 >>> In every instance, I observed the Linux guest disk IO often perform = very well for some time after the guest was first booted. Then the = performance of the guest would drop to a fraction of the original = performance. The benchmark test was run every 5 or 10 minutes in a cron = job. Sometimes the guest would perform well for up to an hour before = performance would drop off. Most of the time it would only perform well = for a few cycles ( 10 - 30 mins ) before performance would drop off. The = only way to restore the performance was to reboot the guest. Once I = determined that the problem was not specific to a particular host or = guest storage option, I switched my testing to only use a block device = as backing storage on the host to avoid hitting any system disk caches. >>>=20 >>> Here is the test script I used in the cron job ... >>>=20 >>> #!/bin/sh >>> FNAME=3D'output.txt' >>>=20 >>> echo = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >> $FNAME >>> echo Begin @ `/usr/bin/date` >> $FNAME >>> echo >> $FNAME >>> /usr/sbin/bonnie++ 2>&1 | /usr/bin/grep -v 'done\|,' >> $FNAME >>> echo >> $FNAME >>> echo End @ `/usr/bin/date` >> $FNAME >>>=20 >>> As you can see, I'm calling bonnie++ with the system defaults. That = uses a data set size that's 2x the guest RAM in an attempt to minimize = the effect of filesystem cache on results. Here is an example of the = output that bonnie++ produces ... >>>=20 >>> Version 2.00 ------Sequential Output------ --Sequential = Input- --Random- >>> -Per Chr- --Block-- -Rewrite- -Per Chr- = --Block-- --Seeks-- >>> Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP /sec %CP >>> linux-blk 63640M 694k 99 1.6g 99 737m 76 985k 99 1.3g = 69 +++++ +++ >>> Latency 11579us 535us 11889us 8597us 21819us = 8238us >>> Version 2.00 ------Sequential Create------ --------Random = Create-------- >>> linux-blk -Create-- --Read--- -Delete-- -Create-- = --Read--- -Delete-- >>> files /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP /sec %CP >>> 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ = +++ +++++ +++ >>> Latency 7620us 126us 1648us 151us 15us = 633us >>>=20 >>> --------------------------------- speed drop = --------------------------------- >>>=20 >>> Version 2.00 ------Sequential Output------ --Sequential = Input- --Random- >>> -Per Chr- --Block-- -Rewrite- -Per Chr- = --Block-- --Seeks-- >>> Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP /sec %CP >>> linux-blk 63640M 676k 99 451m 99 314m 93 951k 99 402m = 99 15167 530 >>> Latency 11902us 8959us 24711us 10185us 20884us = 5831us >>> Version 2.00 ------Sequential Create------ --------Random = Create-------- >>> linux-blk -Create-- --Read--- -Delete-- -Create-- = --Read--- -Delete-- >>> files /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP /sec %CP >>> 16 0 96 +++++ +++ +++++ +++ 0 96 +++++ = +++ 0 75 >>> Latency 343us 165us 1636us 113us 55us = 1836us >>>=20 >>> In the example above, the benchmark test repeated about 20 times = with results that were similar to the performance shown above the dotted = line ( ~ 1.6g/s seq write and 1.3g/s seq read ). After that, the = performance dropped to what's shown below the dotted line which is less = than 1/4 the original speed ( ~ 451m/s seq write and 402m/s seq read ).=20= >>>=20 >>>> 2. What results expecting? >>>>=20 >>> What I expect is that, when I perform the same test with the same = parameters, the results would stay more or less consistent over time. = This is true when KVM is used as the hypervisor on the same hardware and = guest options. That said, I'm not worried about bhyve being consistently = slower than kvm or a FreeBSD guest being consistently slower than a = Linux guest. I'm concerned that the performance drop over time is = indicative of an issue with how bhyve interacts with non-freebsd guests. >>>=20 >>>> 3. VM configuration, virtio-blk disk size, etc. >>>> 4. Full command for tests (including size of test-set), bhyve, = etc. >>> I believe this was answered above. Please let me know if you have = additional questions. >>>=20 >>>>=20 >>>> 5. Did you pass virtio-blk as 512 or 4K ? If 512, probably you = should try 4K. >>>>=20 >>> The testing performed was not exclusively with virtio-blk. >>>=20 >>>=20 >>>> 6. Linux has several read-ahead options for IO schedule, and it = could be related too. >>>>=20 >>> I suppose it's possible that bhyve could be somehow causing the disk = scheduler in the Linux guest to act differently. I'll see if I can = figure out how to disable that in future tests. >>>=20 >>>=20 >>>> Additionally could also you play with =E2=80=9Csync=3Ddisabled=E2=80=9D= volume/zvol option? Of course it is only for write testing. >>> The testing performed was not exclusively with zvols. >>>=20 >>>=20 >>> Once I have more hardware available, I'll try to report back with = more testing. It may be interesting to also see how a Windows guest = performs compared to Linux & FreeBSD. I suspect that this issue may only = be triggered when a fast disk array is in use on the host. My tests use = a 16x SSD RAID 10 array. It's also quite possible that the disk IO = slowdown is only a symptom of another issue that's triggered by the disk = IO test ( please see end of my last post related to scheduler priority = observations ). All I can say for sure is that ... >>>=20 >>> 1) There is a problem and it's reproducible across multiple hosts >>> 2) It affects RHEL8 & RHEL9 guests but not FreeBSD guests >>> 3) It is not specific to any host or guest storage option >>>=20 >>> Thanks, >>>=20 >>> -Matthew >>>=20 --Apple-Mail=_8118B541-F589-4E79-AF6C-3E98D8AADC93 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 <html><head><meta http-equiv=3D"content-type" content=3D"text/html; = charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; = -webkit-nbsp-mode: space; line-break: after-white-space;"><br = id=3D"lineBreakAtBeginningOfMessage"><div><br><blockquote = type=3D"cite"><div>On 28 Feb 2024, at 23:03, Matthew Grooms = <mgrooms@shrew.net> wrote:</div><br = class=3D"Apple-interchange-newline"><div><meta charset=3D"UTF-8"><div = class=3D"moz-cite-prefix" style=3D"caret-color: rgb(0, 0, 0); = font-family: Helvetica; font-size: 14px; font-style: normal; = font-variant-caps: normal; font-weight: 400; letter-spacing: normal; = text-align: start; text-indent: 0px; text-transform: none; white-space: = normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; = text-decoration: none;">...</div></div></blockquote><blockquote = type=3D"cite"><div><div class=3D"moz-cite-prefix" style=3D"caret-color: = rgb(0, 0, 0); font-family: Helvetica; font-size: 14px; font-style: = normal; font-variant-caps: normal; font-weight: 400; letter-spacing: = normal; text-align: start; text-indent: 0px; text-transform: none; = white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; = text-decoration: none;">The virtual disks were provisioned with either a = 128G disk image or a 1TB raw partition, so I don't think space was an = issue.</div><p style=3D"caret-color: rgb(0, 0, 0); font-family: = Helvetica; font-size: 14px; font-style: normal; font-variant-caps: = normal; font-weight: 400; letter-spacing: normal; text-align: start; = text-indent: 0px; text-transform: none; white-space: normal; = word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: = none;">Trim is definitely not an issue. I'm using a tiny fraction of the = 32TB array have tried both heavily under-provisioned HW RAID10 and SW = RAID10 using GEOM. The latter was tested after sending full trim resets = to all drives individually.</p></div></blockquote><div>It could be then = TRIM/UNMAP is not used, zvol (for the instance) becomes full for the = while. ZFS considers it as all blocks are used and write operations = could have troubles. I believe it was recently = fixed.</div><div><br></div><div>Also look at this = one:</div><div><br></div><div> = GuestFS->UNMAP->bhyve->Host-FS->PhysicalDisk</div><div><br></d= iv><div>The problem of UNMAP that it could have unpredictable slowdown = at any time. So I would suggest to check results with enabled and = disabled UNMAP in a guest.</div><br><blockquote type=3D"cite"><div><p = style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: = 14px; font-style: normal; font-variant-caps: normal; font-weight: 400; = letter-spacing: normal; text-align: start; text-indent: 0px; = text-transform: none; white-space: normal; word-spacing: 0px; = -webkit-text-stroke-width: 0px; text-decoration: none;">I will try to = incorporate the rest of your feedback into my next round of testing. If = I can find a benchmark tool that works with a raw block device, that = would be ideal.<br><br></p></div></blockquote><div>Use =E2=80=9Cdd=E2=80=9D= as the first step for read testing;</div><div><br></div><div> = ~# dd if=3D/dev/nvme0n2 of=3D/dev/null bs=3D1M status=3Dprogress = flag=3Ddirect</div><div> ~# dd if=3D/dev/nvme0n2 = of=3D/dev/null bs=3D1M status=3Dprogress</div><div><br></div><div>Compare = results with directio and without.</div><div><br></div><div>=E2=80=9Cfio=E2= =80=9D tool. </div><div> </div><div> 1) write = prepare:</div><div><br></div><div> = ~# <span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">fio</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"background-color: rgb(231, 238, = 238);"><font face=3D"Menlo"><span style=3D"font-size: 12px;">--name=3Dprep= --rw=3Dwrite --verify=3Dcrc32 --loop=3D1 = --numjobs=3D2</span></font></span><span style=3D"font-family: Menlo; = font-size: 12px; background-color: rgb(231, 238, 238);"> = </span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">--time_based --thread</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"font-family: Menlo; font-size: = 12px; background-color: rgb(231, 238, 238);">--bs=3D1M = --iodepth=3D32</span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);"> </span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);">--ioengine=3Dlibaio --direct=3D1</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"font-family: Menlo; font-size: = 12px; background-color: rgb(231, 238, = 238);">--group_reporting</span><span style=3D"font-family: Menlo; = font-size: 12px; background-color: rgb(231, 238, 238);"> = </span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">--size=3D20G</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"font-family: Menlo; font-size: = 12px; background-color: rgb(231, 238, = 238);">--filename=3D/dev/nvme0n2</span></div><div><br></div><div><br></div= ><div> 2) read test:</div><div><br></div><div> = ~# <span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">fio</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"background-color: rgb(231, 238, = 238);"><font face=3D"Menlo"><span style=3D"font-size: = 12px;">--name=3Dreadtest --rw=3Dread =E2=80=94loop=3D30 = --numjobs=3D2</span></font></span><span style=3D"font-family: Menlo; = font-size: 12px; background-color: rgb(231, 238, 238);"> = </span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">--time_based --thread</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"background-color: rgb(231, = 238, 238);"><font face=3D"Menlo"><span style=3D"font-size: = 12px;">=E2=80=94bs=3D256K --iodepth=3D32</span></font></span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"font-family: Menlo; font-size: = 12px; background-color: rgb(231, 238, 238);">--ioengine=3Dlibaio = --direct=3D1</span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);"> </span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);">--group_reporting</span><span style=3D"font-family: Menlo; = font-size: 12px; background-color: rgb(231, 238, 238);"> = </span><span style=3D"font-family: Menlo; font-size: 12px; = background-color: rgb(231, 238, 238);">--size=3D20G</span><span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);"> </span><span style=3D"font-family: Menlo; font-size: = 12px; background-color: rgb(231, 238, = 238);">--filename=3D/dev/nvme0n2</span></div><div> = </div><div>=E2=80=94</div><div>Vitaliy </div><blockquote = type=3D"cite"><div><p style=3D"caret-color: rgb(0, 0, 0); font-family: = Helvetica; font-size: 14px; font-style: normal; font-variant-caps: = normal; font-weight: 400; letter-spacing: normal; text-align: start; = text-indent: 0px; text-transform: none; white-space: normal; = word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: = none;">Thanks,<br><br>-Matthew<br></p><p style=3D"caret-color: rgb(0, 0, = 0); font-family: Helvetica; font-size: 14px; font-style: normal; = font-variant-caps: normal; font-weight: 400; letter-spacing: normal; = text-align: start; text-indent: 0px; text-transform: none; white-space: = normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; = text-decoration: none;"><br></p><blockquote type=3D"cite" = cite=3D"mid:1DAEB435-A613-4A04-B63F-D7AF7A0B7C0A@gmail.com" = style=3D"font-family: Helvetica; font-size: 14px; font-style: normal; = font-variant-caps: normal; font-weight: 400; letter-spacing: normal; = orphans: auto; text-align: start; text-indent: 0px; text-transform: = none; white-space: normal; widows: auto; word-spacing: 0px; = -webkit-text-stroke-width: 0px; text-decoration: none;"><div><span = style=3D"font-size: 15px;">=E2=80=94=E2=80=94</span></div><div><span = style=3D"font-size: 15px;">Vitaliy</span></div><div><div><br><blockquote = type=3D"cite"><div>On 28 Feb 2024, at 21:29, Matthew Grooms<span = class=3D"Apple-converted-space"> </span><a = class=3D"moz-txt-link-rfc2396E" = href=3D"mailto:mgrooms@shrew.net"><mgrooms@shrew.net></a><span = class=3D"Apple-converted-space"> </span>wrote:</div><br = class=3D"Apple-interchange-newline"><div><div><div = class=3D"moz-cite-prefix">On 2/27/24 04:21, Vitaliy Gusev = wrote:<br></div><blockquote type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com">Hi,<div><br></= div><div><div><br><blockquote type=3D"cite"><div>On 23 Feb 2024, at = 18:37, Matthew Grooms<span class=3D"Apple-converted-space"> </span><a= class=3D"moz-txt-link-rfc2396E" href=3D"mailto:mgrooms@shrew.net" = moz-do-not-send=3D"true"><mgrooms@shrew.net></a><span = class=3D"Apple-converted-space"> </span>wrote:</div><br = class=3D"Apple-interchange-newline"><div><div><blockquote = type=3D"cite">...</blockquote>The problem occurs when an image file is = used on either ZFS or UFS. The problem also occurs when the virtual disk = is backed by a raw disk partition or a ZVOL. This issue isn't related to = a specific underlying = filesystem.<br><br></div></div></blockquote><div><br></div>Do I = understand right, you ran testing inside VM inside guest VM on = ext4 filesystem? If so you should be aware about additional overhead in = comparison when you were running tests on the = hosts.</div><div><br></div></div></blockquote><p>Hi Vitaliy,<br><br>I = appreciate you providing the feedback and suggestions. I spent over a = week trying as many combinations of host and guest options as possible = to narrow this issue down to a specific host storage or a guest device = model option. Unfortunately the problem occurred with every combination = I tested while running Linux as the guest. Note, I only tested RHEL8 = & RHEL9 compatible distributions ( Alma & Rocky ). The problem = did not occur when I ran FreeBSD as the guest. The problem did not occur = when I ran KVM in the host and Linux as the guest.<br></p><blockquote = type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div>I = would suggest to run fio (or even dd) on raw disk device inside VM, i.e. = without filesystem at all. Just do not forget do =E2=80=9C<span = style=3D"font-family: Menlo; font-size: 12px; background-color: rgb(231, = 238, 238);">echo 3 > /proc/sys/vm/drop_caches</span>=E2=80=9D in = Linux Guest VM before you run tests.<span = class=3D"Apple-converted-space"> </span><br></div></div></blockquote>= <p>The two servers I was using to test with are are no longer available. = However, I'll have two more identical servers arriving in the next week = or so. I'll try to run additional tests and report back here. I used = bonnie++ as that was easily installed from the package repos on all the = systems I tested.<br></p><blockquote type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div><br>= </div><div>Could you also give more information = about:</div><div><br></div><div> 1. What results did you get = (decode bonnie++ output)?</div></div></blockquote><p>If you look back at = this email thread, there are many examples of running bonnie++ on the = guest. I first ran the tests on the host system using Linux + ext4 and = FreeBSD 14 + UFS & ZFS to get a baseline of performance. Then I ran = bonnie++ tests using bhyve as the hypervisor and Linux & FreeBSD as = the guest. The combination of host and guest storage options included = ...<br><br>1) block device + virtio blk<br>2) block device + nvme<br>3) = UFS disk image + virtio blk<br>4) UFS disk image + nvme<br>5) ZFS disk = image + virtio blk<br>6) ZFS disk image + nvme<br>7) ZVOL + virtio = blk<br>8) ZVOL + nvme<br><br>In every instance, I observed the Linux = guest disk IO often perform very well for some time after the guest was = first booted. Then the performance of the guest would drop to a fraction = of the original performance. The benchmark test was run every 5 or 10 = minutes in a cron job. Sometimes the guest would perform well for up to = an hour before performance would drop off. Most of the time it would = only perform well for a few cycles ( 10 - 30 mins ) before performance = would drop off. The only way to restore the performance was to reboot = the guest. Once I determined that the problem was not specific to a = particular host or guest storage option, I switched my testing to only = use a block device as backing storage on the host to avoid hitting any = system disk caches.<br><br>Here is the test script I used in the cron = job ...<br><br><font size=3D"2" = face=3D"monospace">#!/bin/sh<br>FNAME=3D'output.txt'<br></font><font = face=3D"monospace"><font size=3D"2"><br>echo = =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >> $FNAME<br>echo Begin @ `/usr/bin/date` >> = $FNAME<br>echo >> $FNAME<br>/usr/sbin/bonnie++ 2>&1 | = /usr/bin/grep -v 'done\|,' >> $FNAME<br>echo >> = $FNAME<br>echo End @ `/usr/bin/date` >> = $FNAME</font><br></font><br>As you can see, I'm calling bonnie++ with = the system defaults. That uses a data set size that's 2x the guest RAM = in an attempt to minimize the effect of filesystem cache on results. = Here is an example of the output that bonnie++ produces ...<br><br><font = face=3D"monospace"><font size=3D"2">Version = 2.00 ------Sequential Output------ = --Sequential Input- = --Random-<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>-Per Chr- --Block-- = -Rewrite- -Per Chr- --Block-- --Seeks--<br>Name:Size = etc /sec %CP /sec = %CP /sec %CP /sec %CP /sec %CP /sec = %CP<br>linux-blk 63640M 694k 99 = 1.6g 99 737m 76 985k 99 1.3g = 69 +++++ = +++<br>Latency = 11579us 535us = 11889us 8597us 21819us = 8238us<br>Version 2.00 = ------Sequential Create------ --------Random = Create--------<br>linux-blk  = ; -Create-- --Read--- -Delete-- -Create-- --Read--- = -Delete--<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>files /sec %CP = /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>16 +++++ +++ +++++ +++ = +++++ +++ +++++ +++ +++++ +++ +++++ = +++<br>Latency = 7620us = 126us 1648us = 151us 15us = 633us<br><br>--------------------------------- speed drop = ---------------------------------<br><br>Version = 2.00 ------Sequential Output------ = --Sequential Input- = --Random-<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>-Per Chr- --Block-- = -Rewrite- -Per Chr- --Block-- --Seeks--<br>Name:Size = etc /sec %CP /sec = %CP /sec %CP /sec %CP /sec %CP /sec = %CP<br>linux-blk 63640M 676k 99 = 451m 99 314m 93 951k 99 402m = 99 15167 = 530<br>Latency = 11902us 8959us = 24711us 10185us 20884us = 5831us<br>Version 2.00 = ------Sequential Create------ --------Random = Create--------<br>linux-blk  = ; -Create-- --Read--- -Delete-- -Create-- --Read--- = -Delete--<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>files /sec %CP = /sec %CP /sec %CP /sec %CP /sec %CP /sec = %CP<br> &= nbsp; <span = class=3D"Apple-converted-space"> </span>16 = 0 96 +++++ +++ +++++ +++ 0 96 +++++ = +++ 0 = 75<br>Latency &= nbsp; 343us = 165us 1636us = 113us 55us = 1836us<br></font></font><br>In the example above, the benchmark test = repeated about 20 times with results that were similar to the = performance shown above the dotted line ( ~ 1.6g/s seq write and 1.3g/s = seq read ). After that, the performance dropped to what's shown below = the dotted line which is less than 1/4 the original speed ( ~ 451m/s seq = write and 402m/s seq read ).<span = class=3D"Apple-converted-space"> </span><br></p><blockquote = type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div>&nbs= p;2. What results expecting?<br><br></div></div></blockquote><p>What I = expect is that, when I perform the same test with the same parameters, = the results would stay more or less consistent over time. This is = true when KVM is used as the hypervisor on the same hardware and guest = options. That said, I'm not worried about bhyve being consistently = slower than kvm or a FreeBSD guest being consistently slower than a = Linux guest. I'm concerned that the performance drop over time is = indicative of an issue with how bhyve interacts with non-freebsd = guests.<br></p><blockquote type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div>&nbs= p;3. VM configuration, virtio-blk disk size, etc.</div><div> 4. = Full command for tests (including size of test-set), bhyve, = etc.<br></div></div></blockquote><p>I believe this was answered above. = Please let me know if you have additional questions.<br></p><blockquote = type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div><br>= </div><div> 5. Did you pass virtio-blk as 512 or 4K ? If 512, = probably you should try 4K.<br><br></div></div></blockquote><p>The = testing performed was not exclusively with = virtio-blk.<br><br></p><blockquote type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div>&nbs= p;6. Linux has several read-ahead options for IO schedule, and it could = be related too.</div><div><br></div></div></blockquote><p>I suppose it's = possible that bhyve could be somehow causing the disk scheduler in the = Linux guest to act differently. I'll see if I can figure out how to = disable that in future tests.<br><br></p><blockquote type=3D"cite" = cite=3D"mid:BE794E98-7B69-4626-BB66-B56F23D6A67E@gmail.com"><div><div>Addi= tionally could also you play with =E2=80=9Csync=3Ddisabled=E2=80=9D = volume/zvol option? Of course it is only for write = testing.<br></div></div></blockquote><p>The testing performed was not = exclusively with zvols.<br><br></p><p>Once I have more hardware = available, I'll try to report back with more testing. It may be = interesting to also see how a Windows guest performs compared to Linux = & FreeBSD. I suspect that this issue may only be triggered when a = fast disk array is in use on the host. My tests use a 16x SSD RAID 10 = array. It's also quite possible that the disk IO slowdown is only a = symptom of another issue that's triggered by the disk IO test ( please = see end of my last post related to scheduler priority observations ). = All I can say for sure is that ...<br><br>1) There is a problem and it's = reproducible across multiple hosts<br>2) It affects RHEL8 & RHEL9 = guests but not FreeBSD guests<br>3) It is not specific to any host or = guest storage = option<br><br>Thanks,</p><p>-Matthew</p></div></div></blockquote></div></d= iv></blockquote></div></blockquote></div><br></body></html>= --Apple-Mail=_8118B541-F589-4E79-AF6C-3E98D8AADC93--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3850080E-EBD1-4414-9C4E-DD89611C9F58>