From owner-freebsd-virtualization@freebsd.org Mon Oct 22 11:26:07 2018 Return-Path: Delivered-To: freebsd-virtualization@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5BA08FF7313 for ; Mon, 22 Oct 2018 11:26:07 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mx0.gentlemail.de (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D45D37527E for ; Mon, 22 Oct 2018 11:26:06 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mh0.gentlemail.de (mh0.gentlemail.de [IPv6:2a00:e10:2800::a135]) by mx0.gentlemail.de (8.14.5/8.14.5) with ESMTP id w9MBQ4AH055783 for ; Mon, 22 Oct 2018 13:26:04 +0200 (CEST) (envelope-from freebsd@omnilan.de) Received: from titan.inop.mo1.omnilan.net (s1.omnilan.de [217.91.127.234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mh0.gentlemail.de (Postfix) with ESMTPSA id 2C1737F3 for ; Mon, 22 Oct 2018 13:26:04 +0200 (CEST) To: freebsd-virtualization@freebsd.org From: Harry Schmalzbauer Subject: bhyve win-guest benchmark comparing Organization: OmniLAN Message-ID: <9e7f4c01-6cd1-4045-1a5b-69c804b3881b@omnilan.de> Date: Mon, 22 Oct 2018 13:26:03 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]); Mon, 22 Oct 2018 13:26:04 +0200 (CEST) X-Milter: Spamilter (Reciever: mx0.gentlemail.de; Sender-ip: ; Sender-helo: mh0.gentlemail.de; ) X-BeenThere: freebsd-virtualization@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Discussion of various virtualization techniques FreeBSD supports." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Oct 2018 11:26:07 -0000 Hello, I started using bhyve for some of my local setups about one or two years ago.  I'm utilizing rc.local along with some homebrew start_if.NG scripts to connect tap(4) and ng_bridge(4) with a single vlan(4) uplink child, so I know bhyve well enough to know that it isn't comparable in many ways with ESXi as a "product", and I'm completely fine with the extra jobs I have to do for using bhyve! But I've always felt that there are significant performance penalties, wich hasn't been a big issue for my own guests (tinker and WSUS windows). Since I wanted to evaluate the possibility to replace ESXi instances elsewhere, I decided to run some hopefully meaningful benchmark tests.  Unfortunately, the performance penalty is much too high.  I'd like to share my measurings here. Host-Config:     database-------------------------------------          |                                         |       da0    da1 <- Windows Server 2012R2, SQLExpress2017      /  \      |     | r0 |    |     S    S    S     S    S    S     D    D    D mps0/vmhba0 | |                     ,- bhyve-ssd (ufs 12.0-beta1) |    ahci0/vmhba32 –-: |    |                `- esxi-ssd (6.7) |    | 32GXeonE34x3.6G(hyperthreading enabled) So the guest is booting from it's own physical disk (single SSD via mps). Guest-Config: When the host was running FreeBSD, the relevant bhyve disk setup reads "-s 3,ahci,hd:/dev/da1,hd:/dev/da0" Likewise, when the host was running ESXi, the corresponding disks/vml.... were attached to the ESXi "SATA Controller" (via RDM). So in both cases the built-in generic guest-OS (Win2k12R2) AHCI driver was in use, for both, the OS-system disk and the db/bechmark disk. Both hypervisors assign 2 CPU cores (in one package) and 4GB RAM. The guest operating system of choice is Windows Server 2012R2.  As real-world application I chose MS-SQLServerExpress 2017. Simply because I looked for a "industry" benchmark tool and found a trial version which was easy to setup, bringing test-data along with several workload templates. After OS-setup was done, all (G)UI-actions were done through RDP session in both cases. Test-Runs: Each hypervisor had only the one bench-guest running, no other tasks/guests were running besides system's native standard processes. Since the time between powering up the guest and finishing logon differed notably (~5s vs. ~20s) from one host to the other, I did a quick synthetic IO-Test beforehand. I'm using IOmeter since heise.de published a great test pattern called IOmix – about 18 years ago I guess.  This access pattern has always perfectly reflected the system performance for human computer usage with non-caculation-centric applications, and still is my favourite, despite throughput and latency changed by some orders of manitudes during the last decade (and I had defined something for "fio" which mimics IOmix and shows reasonable relational results; but I'm still prefering IOmeter for homogenous IO benchmarking). The results is about factor 7 :-( ~3800iops&69MB/s (CPU-guest-usage 42%IOmeter+12%irq)                 vs. ~29000iops&530MB/s (CPU-guest-usage 11%IOmeter+19%irq)     [with debug kernel and debug-malloc, numbers are 3000iops&56MB/s,      virtio-blk instead of ahci,hd: results in 5660iops&104MB/s with non-debug kernel      – much better, but even higher CPU load and still factor 4 slower] What I don't understand is, why the IOmeter process differs that much in CPU utilization!?!  It's the same binary on the same OS (guest) with the same OS-driver and the same underlying hardware – "just" the AHCI emulation and the vmm differ... Unfortunately, the picture for virtio-net vs. vmxnet3 is similar sad. Copying a single 5GB file from CIFS share to DB-ssd results in 100% guest-CPU usage, where 40% are irqs and the throughput max out at ~40MB/s. When copying the same file from the same source with the same guest on the same host but host booted ESXi, there's 20% guest-CPU usage while transfering 111MB/s – the uplink GbE limit. These synthetic benchmark very well explain the "feelable" difference when using a guest between the two hypervisors, but fortunately not by that factor most times.  So I continued with the initially aimed database test. Disclaimer: I'm no database expert and it's not about achieving maximum performance from DB workload.  It's just about generating reporducable CPU-bound load together with IO load to illustrate overall performance _differences_. So I combined two "industry standard" benchmarks from "Benchmark Factory" and scaled them (TPC-C by 75 and TCP-H by 3) to generate a database with 10GB size. Interestingly, the difference is by far not as big as expected after the previous results. There's clearly a difference, but the worst case isn't even factor 2. I did two consecutive runs for each hypervisor. Run4 and Run5 were on bhyve, Run6 and Run7 on ESXi. Please see the graph here: http://www.schmalzbauer.de/downloads/sqlbench_bhyve-esxi.png Even more interestingly, disk load "graphs" looked very similar – I don't really have a graph for the bhyve-run. But during the bhyve-run I saw 200-500MB/s transfer bandwidth, which is exactly what I see in the ESXi graph. So the bhyve setup is able to deliver constantly high performance in that case! But there's a variation which I don't understand.  Almost any other application suffers from disk IO constraints on bhyve. Of course, block size ist the most important parameter here, but MSSQL doesn't use big block sizes as far as I know (formerly these were 8k and then, since 2010 I guess, 64k). This result perfectly reflects my observation with my local WSUS, which is also database load and I never found performance to be an issue on that guest. I have another picture comparing pure synthetic benchmarks, showing only smaller "FPU/ALU" differences (memory bandwith measured with Intels mlc were exactly the same) but huge disk IO differences, although I used virtio-blk instead of ahci,hd: for byhve (where HDD selection show's "Red Hat VirtIO"): http://www.schmalzbauer.de/downloads/sbmk_bhyve-esxi.png Question: Are these (emulation(only?) related, I guess) performace issues well known?  I mean, does somebody know what needs to be done in what area, in order to catch up with the other results? So it's just a matter of time/resources? Or are these results surprising and extensive analysis must be done before anybody can tell how to fix the IO limitations? Is the root cause for the problematic low virtio-net throughput probably the same as for the disk IO limits?  Both really hurt in my use case and the host is not idling in relation, but even showing higher load with lower results.  So even if the lower user-experience-performance would be considered as toleratable, the guests/host ratio was only half dense. Thanks, -harry