FreeBSD Mail Archives

Date:      Tue, 03 May 2011 23:14:28 -0700
From:      Ted Mittelstaedt <tedm@mittelstaedt.us>
To:        freebsd-emulation@freebsd.org
Subject:   Re: virtualbox I/O 3 times slower than KVM?
Message-ID:  <4DC0EEC4.8030507@mittelstaedt.us>
In-Reply-To: <32257637.1304487290435.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>
References:  <32257637.1304487290435.JavaMail.root@mswamui-blood.atl.sa.earthlink.net>

On 5/3/2011 10:34 PM, John wrote:
>
>
>
> -----Original Message-----
>> From: Ted Mittelstaedt<tedm@mittelstaedt.us> Sent: May 4, 2011
>> 12:48 AM To: freebsd-emulation@freebsd.org Subject: Re: virtualbox
>> I/O 3 times slower than KVM?
>>
>> On 5/3/2011 11:25 AM, John wrote:
>>>
>>> -----Original Message-----
>>>> From: Ted Mittelstaedt<tedm@mittelstaedt.us>  Sent: May 3,
>>>> 2011 12:02 AM To: Adam Vande More<amvandemore@gmail.com>  Cc:
>>>> freebsd-emulation@freebsd.org Subject: Re: virtualbox I/O 3
>>>> times slower than KVM?
>>>>
>>>> On 5/2/2011 7:39 PM, Adam Vande More wrote:
>>>>> On Mon, May 2, 2011 at 4:30 PM, Ted
>>>>> Mittelstaedt<tedm@mittelstaedt.us<mailto:tedm@mittelstaedt.us>>
>>>>>
>>>>>
wrote:
>>>>>
>>>>> that's sync within the VM.  Where is the bottleneck taking
>>>>> place? If the bottleneck is hypervisor to host, then the
>>>>> guest to vm write may write all it's data to a memory buffer
>>>>> in the hypervisor that is then slower-writing it to the
>>>>> filesystem.  In that case killing the guest without killing
>>>>> the VM manager will allow the buffer to complete emptying
>>>>> since the hypervisor isn't actually being shut down.
>>>>>
>>>>>
>>>>> No the bottle neck is the emulated hardware inside the VM
>>>>> process container.  This is easy to observe, just start a
>>>>> bound process in the VM and watch top host side.  Also the
>>>>> hypervisor uses native host IO driver, there's no reason for
>>>>> it to be slow. Since it's the emulated NIC which is the
>>>>> bottleneck, there is nothing left to issue the write. Further
>>>>> empirical evidence for this can be seen by by watching gstat
>>>>> on VM running with an md or ZVOL backed storage.  I already
>>>>> utilize ZVOL's for this so it was pretty easy to confirm no
>>>>> IO occurs when the VM is paused or shutdown.
>>>>>
>>>>> Is his app going to ever face the extremely bad scenario,
>>>>> though?
>>>>>
>>>>>
>>>>> The point is it should be relatively easy to induce patterns
>>>>> you expect to see in production.  If you can't, I would
>>>>> consider that a problem. Testing out theories(performance
>>>>> based or otherwise) on a production system is not a good way
>>>>> to keep the continued faith of your clients when the
>>>>> production system is a mission critical one.  Maybe throwing
>>>>> more hardware at a problem is the first line of defense for
>>>>> some companies, unfortunately I don't work for them.  Are
>>>>> they hiring? ;)  I understand the logic of such an approach
>>>>> and have even argued for it occasionally. Unfortunately
>>>>> payroll is already in the budget, extra hardware is not even
>>>>> if it would be a net savings.
>>>>>
>>>>
>>>> Most if not all sites I've ever been in that run Windows
>>>> servers behave in this manner.  With most of these sites SOP is
>>>> to "prove" that the existing hardware is inadequate by loading
>>>> whatever Windows software that management wants loaded then
>>>> letting the users on the network scream about it.  Then money
>>>> magically frees itself up when there wasn't any before.  Since
>>>> of course management will never blame the OS for the slowness,
>>>> always the hardware.
>>>>
>>>> Understand I'm not advocating this, just making an
>>>> observation.
>>>>
>>>> Understand that I'm not against testing but I've seen people
>>>> get so engrossed in spending time constructing test suites that
>>>> they have ended up wasting a lot of money.  I would have to
>>>> ask, how much time did the OP who started this thread take
>>>> building 2 systems, a Linux and a BSD system?  How much time
>>>> has he spent trying to get the BSD system to "work as well as
>>>> the Linux" system?  Wouldn't it have been cheaper for him to
>>>> not spend that time and just put the Linux system into
>>>> production?
>>>>
>>>> Ted
>>>
>>> Thanks a lot for everyone's insights and suggestions.  The CentOS
>>> on the KVM is a production server, so I took some time to
>>> prepare another CentOS on that KVM and did the test as Ted
>>> suggested before (for comparison, right now the test freebsd is
>>> the only guest on the virtualbox).
>>>
>>> What I do is to cat the 330MB binary file (XP service pack from
>>> Microsoft) 20 times into a single 6.6GB file, "date" before and
>>> afterwards, and after the second date finishes, immediately
>>> Force power shut down.  There are two observations:
>>>
>>> 1. the time to complete copying into this 6.6GB file were 72s,
>>> 44s, 79s in three runs, presumably because there is another
>>> production VM on the same host.  The average is 65s, so it's
>>> about 100MB/s. 2. After immediately power down, I do found the
>>> resulting file was less than 6.6GB.  So indeed the VM claimed the
>>> completion of the copying before it actually did.
>>>
>>
>> For clarity, what your saying is the CentOS guest OS claimed the
>> copy had completed before it actually did, correct?  This is
>> consistent with async-mounted filesystems which I believe is the
>> default under CentOS. Your guest is mounting it's own filesystem
>> inside the VM async mount. So when the copy completes and you get
>> back to the shell prompt on the guest, a memory buffer in the guest
>> OS is still copying the last bits of the file to the disk.
>>
>>> I then did the same thing on the virtualbox, since I don't want
>>> the above premature I/O, I made sure the "Use Host I/O cache" is
>>> unchecked for the VM storage.
>>>
>>
>> That setting isn't going to change how the guest async-mounts it's
>> filesystems.  All it does is force the hypervisor to not use some
>> caching that the hypervisor is provided with by the host OS.
>>
>>> 1. the time to complete copying into this 6.6GB file was 119s
>>> and 92s, the average is 105s, so the speed is 62MB/s. 2. after
>>> immediately "Reset" the machine, I couldn't boot.  Both times it
>>> asked me to do fsck for that partition (GPT 2.2T).  But after
>>> finally powering up, I found the file was also less than 6.6GB
>>> both times as well.
>>>
>>
>> I would imagine this would happen.
>>
>>> So looks like virtualbox also suffers caching problem?  Or did I
>>> do anything wrong?
>>>
>>
>> There isn't a "caching problem"  As we have said on this forum the
>> speed that the actual write is happening is the same under the
>> FreeBSD guest and the CentOS guest.  The only difference is the
>> FreeBSD guest is sync-mounting it's filesystem within the virtual
>> machine and the CentOS guest is async-mounting it's filesystem
>> within the virtual machine.
>>
>> Async mount is always faster for writes because what is actually
>> going on is that the write goes to a memory buffer then the OS
>> completes the write "behind the scenes"  In many cases when the
>> data in a file is rapidly changing, the write may never go to disk
>> at all, if the OS sees successive writes to the same part of the
>> file it will simply make the writes to the memory buffer then get
>> around to updating the disk when it feels like.
>>
>>> I didn't spend extra time optimizing either the linux or the
>>> freebsd, they are both the production systems from centos and
>>> freebsd.  I just want to have a production quality system without
>>> too much customized work.
>>>
>>> Also, most servers will be mail servers and web servers, with
>>> some utilization of database.  Granted, copying 6.6GB file is
>>> atypical on these servers, but I just want to get an idea of what
>>> the server is capable of.  I do not know a test software that can
>>> benchmark my usage pattern and is readily available on both
>>> centos and freebsd.
>>>
>>
>> What it really sounds like to me is that your just not
>> understanding the difference in how the filesystem is mounted.  For
>> starters you have your host OS which the hypervisor is running on.
>> You have a large file on that host which comprises the VM, either
>> freeBSD or CentOS. When the FreeBSD or CentOS guest is making it's
>> writes it is making them into that large file.  if the host has
>> that file sync-mounted then it will slow file access by the
>> hypervisor to that file.
>>
>> And then you have the guest OSes which themselves have their own
>> memory buffers and mount chunks of that file as their filesystems.
>> They can mount these chunks sync or async.  If they mount them
>> async then it makes access to those chunks faster also.
>>
>> There is a tradeoff here.  If you sync-mount a filesystem then if
>> the operating system halts or crashes then there is usually little
>> to no file system damage.  But, access to the disk will be slowest.
>> If you async mount a filesystem then if the operating system
>> crashes then you will have a lot of garbage and file corruption.
>> But, access will be the fastest.
>>
>> A very common configuration for a mailserver is when your
>> partitioning the filesystem to create the usual /, swap, /usr,
>> /tmp,&  /var - then create an additional /home and "mail".  Then
>> you either mount "mail" on /var/mail or you mount it on /mail and
>> softlink /var/mail to /mail. Then you setup /tmp, /home, and /mail
>> or /var/mail as async mount and everything else sync mount, and
>> softlink /var/spool to /tmp.
>>
>> That way if the mailserver reboots or crashes then the program
>> files are generally not affected even if the e-mail is scotched,
>> yet you get the fastest possible disk performance.  If a partition
>> is so far gone that it cannot even be repaired by fsck then you can
>> just newfs it and start over.  It is also a lot easier to create a
>> dump/restore back up scheme, too.
>>
>> With CentOS/Linux it's a bit different because that OS mounts the
>> entire disk on / and creates subdirectories for everything.  That
>> is one of the (many) reasons I don't ever use Linux for
>> mailservers, you do not have the same kind of fine-grained control.
>> But you can create multiple partitions on CentOS, too.
>>
>> Also, the fact is that the FreeBSD filesystem and OS has been
>> heavily optimized and if the mailserver isn't that busy you don't
>> need to bother async mounting any of it's partitions, because the
>> system will simply spawn more processes.  You got to think of it
>> this way, for example with a mailserver let's say sync mounting
>> causes each piece of e-mail to spend 15 ms in disk access and let's
>> say async mounting cuts that to 5ms  - well if the mailserver
>> normally runs at about 10 simultaneous sendmail instances under
>> async mounting then it will run 30 instances under sync mounting at
>> the same throughput - and with each instance only taking 100MB of
>> ram, you can toss a couple extra GB of ram in the server and forget
>> about it.
>>
>> Ted
>>
>
> Hi Ted,
>
> Thanks for taking time to explain this.  I'm so sorry I didn't pay
> attention to this (a)sync mounting options before.  Are you talking
> about these options in the /etc/fstab?

yes

>  I just checked I didn't give
> any option here (other than 'sw'), for all disk partitions on both
> the FreeBSD virtualbox host and guest.  And the mount manpage said
> then by default it's sync mounting.

yes.  the absense of the keyword "async" means it's sync mounted.

> Does that mean my FreeBSD guest
> already sync mounted?

it means your guest has mounted it's filesystems sync.

So what is happening is the FreeBSD guest virtual OS makes it's write
sync, and is getting told by the hypervisor (virtual box) that the
write is completed, when in reality all that has happened is that
the guest OS has completed a write to the virtual filesystem.  When
you reset the system the write is still going from the hypervisor to
the host filesystem, and that is mounted async.

Here is what I mean:

FreeBSD
syncmount on
virtual filesystem controlled by hypervisor, in memory
memory buffers of hypervisor
asyncmounted on
host OS,
hardware cache mounted on
physical disk

The FreeBSD guest disk IO is virtual.

>  Then why it also prematurely declared
> completion of the writing and couldn't boot?
>
> (At the same time, I did confirm that on CentOS the default mount
> option includes "async".)

There are in this scenario at least 5 layers of disk caching going on:

FreeBSD caching to it's virtual filesystem.

The virtual filesystem the hypervisor provides is probably also
cached in the ram that the host OS gives to the hypervisor in
a hypervisor i/o cache.

Then the virtual filesystem of the hypervisor is mounted
on the disk, so the host OS is running a cache

then all that is on the hardware cache of the
raid controller

And finally, each individual disk of the array has it's own
internal cache.

The better hardware raid cards are battery-backed up for this reason.

All of this is why it's not easy to get the REAL disk throughput
of a system because there is so much caching.  The tools written
to do this have to do fancy stuff like generate data in
many multiple files and read and write them back and forth in
order to fill up all of the caches so that the systems and disks
are unable to cache anything and have to do the actual writes,
and the tools have to create and delete many of these files so
the systems don't try to get smart and just manipulate files in
a memory disk cache somewhere.

Ted

> _______________________________________________
> freebsd-emulation@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-emulation To
> unsubscribe, send any mail to
> "freebsd-emulation-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DC0EEC4.8030507>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation