From owner-freebsd-hackers@freebsd.org  Tue Apr 11 22:10:38 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0553FD3ADB0
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue, 11 Apr 2017 22:10:38 +0000 (UTC)
 (envelope-from torek@elf.torek.net)
Received: from elf.torek.net (mail.torek.net [96.90.199.121])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "elf.torek.net", Issuer "elf.torek.net" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id D59A3957
 for <freebsd-hackers@freebsd.org>; Tue, 11 Apr 2017 22:10:37 +0000 (UTC)
 (envelope-from torek@elf.torek.net)
Received: from elf.torek.net (localhost [127.0.0.1])
 by elf.torek.net (8.15.2/8.15.2) with ESMTPS id v3BMAVhu093703
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Tue, 11 Apr 2017 15:10:31 -0700 (PDT)
 (envelope-from torek@elf.torek.net)
Received: (from torek@localhost)
 by elf.torek.net (8.15.2/8.15.2/Submit) id v3BMAVSe093702;
 Tue, 11 Apr 2017 15:10:31 -0700 (PDT) (envelope-from torek)
Date: Tue, 11 Apr 2017 15:10:31 -0700 (PDT)
From: Chris Torek <torek@elf.torek.net>
Message-Id: <201704112210.v3BMAVSe093702@elf.torek.net>
To: f.v.anton@gmail.com, freebsd-hackers@freebsd.org
Subject: Re: On COW memory mapping in d_mmap_single
In-Reply-To: <CANXdjjZrjxhbqhZ13sAuZP7cqpvYU8CJusQ2NEpGuRCVMgr0=g@mail.gmail.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2
 (elf.torek.net [127.0.0.1]); Tue, 11 Apr 2017 15:10:31 -0700 (PDT)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Apr 2017 22:10:38 -0000

>Yes, all vCPUs are locked before calling mmap(). I agree that we don't
>need 'COW', as long as we keep all vCPUs locked while we copy the
>entire VM memory. But this might take a while, imagine a VM with 32GB
>or more of RAM. This will take maybe minutes to write to disk, so we
>don't actually want the VM to be freezed for so long. That's the
>reason we'd like to map the memory COW and then unlock vCPUs.

You'll need to save the device state while holding the CPUs locked,
too, so that the virtio queues can be in sync when you restore.


>It's a OBJT_DEFAULT. It's not a device object, it's the memory object
>given to guest to use as physical memory.

Your copy code path is basically a simplified vm_map_copy_entry()
as called from vmspace_fork() for the MAP_INHERIT case.  But if
these are OBJT_DEFAULT, shouldn't you be calling vm_object_collapse()?
See https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/vm/vm_map.c#L3170
(Maybe src_object->handle is never NULL?  There are several things
in the VM object code that I do not understand fully here, so this
might be the case.)

>>Next, how do you undo the damage done by your 'COW' ?

>This is one thing that we've thought about, but we don't have a
>solution for now. I agree it is very important, though. I figured that
>it might be possible to 'unmark' the memory object as COW with some
>additional tricks.

I think you may be better off doing actual vm_map_copy_entry()
calls.

I am assuming, here, that snapshot-saving is implemented by
sending a request to the running bhyve, which spins off a thread
or process that does the snapshot-save.  If you spin it off as
a real process, i.e., do a fork(), you will get the existing
VM system to do all the work for you.  The overall strategy
then looks something like this:

    handle_external_suspend_or_snapshot_request() {
        set global suspending flag /* if needed */
        stop all vcpus
        signal virtio and emulated devices to quiesce, if needed
        if (snapshot) {
            open snapshot file
            pid = fork()
            if (pid == 0) { /* child */
                COW is now in effect on memory: save more-volatile
                    vcpu and dev state
                pthread_cond_signal parent that it's safe to resume
                save RAM state
                close snapshot file
                _exit(0)
            }
	    if (pid < 0) ... handle error ...
            /* parent */
	    close snapshot file
            wait for child to signal OK to resume
        } else {
            wait for external resume signal
        }
        clear suspending flag
        resume devices and vcpus
    }

To resume a snapshot from a file, we load its state and then run
the last two steps (clear suspending flag and resume devices and
vcpus).

This way all the COW action happens through fork(), so there is no
new kernel side code required

(Frankly, I think the hard part here is saving device and virtual
APIC state.  If you have the vlapic state saving working, you have
made pretty good progress.)

Chris