From owner-freebsd-hackers@freebsd.org  Wed Apr 12 11:11:34 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A8899D3BE55
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Wed, 12 Apr 2017 11:11:34 +0000 (UTC)
 (envelope-from f.v.anton@gmail.com)
Received: from mail-wm0-x230.google.com (mail-wm0-x230.google.com
 [IPv6:2a00:1450:400c:c09::230])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 61F3DF53;
 Wed, 12 Apr 2017 11:11:34 +0000 (UTC)
 (envelope-from f.v.anton@gmail.com)
Received: by mail-wm0-x230.google.com with SMTP id w204so17963605wmd.1;
 Wed, 12 Apr 2017 04:11:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=01JWzR2h3/k+MbRsSu7klNwFdmVaCbHD6sYRuRRfEPw=;
 b=KK597+ncNU3vNRkDpAZotL6Nlg1MvPfNeweqQG3OCChnWc+bZRWcQZV+xCvvEcbNzE
 t2ku03xGEPfstF5DXPu+pVZ1w4fv9gtbbXm9y9datLaHVQPmRK5dsSH0vqF9FI/APayo
 bm4zTTopn4gEm3Pbrl3kYLhy3s0Cc9UDW7rVCj4leU+e382ctSjL+bL8u1SArU6CdwJs
 USsHb4wYeZmxEo+u4KRI+nnygEyA2uJNBRNEWLOD68oTaEu77u6XwQlO794/88AUuX7h
 n6D9UXa05p3vc6yNlvMLanhLW+bl+wIG/pHlj80d9pY1YdE8d2eJzpcnb//T76/pMjEQ
 FnoQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=01JWzR2h3/k+MbRsSu7klNwFdmVaCbHD6sYRuRRfEPw=;
 b=nnbZdCG4sd1Vt2AK/kQeC91dw1TYkl9F+ivbrkUpUwKhBKww6m6dAndXPyJs/RjSFx
 LhXfEexY4CfQx/fyaS5/ueNLsBvrvD33x6t1Bq4r3q/1I3+8Iov9UmNemmXyuAIZqqRm
 RmrLG53opiQVy0rz8BspOW+5hvNwgEge1YM7I894gxxwtG4YW5Z2c+3r9VbC8+vhcNBz
 gLu6SIY/RyFwurVBPYTFY6KYqz1k+ykyZ5UEedqmim/Ezq0CCzYCvlynoXb7womqrKNO
 QFtmZai3Nq+iYptEyYxZ7mo2EsxF7nE8j6wtK1tjKeAgv1LC1txqpIS7aS3AtVsUS8h2
 agew==
X-Gm-Message-State: AN3rC/7nIYM0lYVjHBp4+TMIcNcTMJFzBfU+G1U0dXKLaw83JfpHd3jl
 hlOc6ubllGIfmkBmL58aLrPHLKCB/OAf
X-Received: by 10.28.6.203 with SMTP id 194mr20107190wmg.125.1491995491399;
 Wed, 12 Apr 2017 04:11:31 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.223.178.10 with HTTP; Wed, 12 Apr 2017 04:11:30 -0700 (PDT)
In-Reply-To: <201704112210.v3BMAVSe093702@elf.torek.net>
References: <CANXdjjZrjxhbqhZ13sAuZP7cqpvYU8CJusQ2NEpGuRCVMgr0=g@mail.gmail.com>
 <201704112210.v3BMAVSe093702@elf.torek.net>
From: Flavius Anton <f.v.anton@gmail.com>
Date: Wed, 12 Apr 2017 14:11:30 +0300
Message-ID: <CANXdjjbuRvh77zxmOyEkYKAeMsj7CEKYUCD1a4o72nPGz17-xA@mail.gmail.com>
Subject: Re: On COW memory mapping in d_mmap_single
To: Chris Torek <torek@elf.torek.net>, freebsd-hackers@freebsd.org
Cc: Peter Grehan <grehan@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Apr 2017 11:11:34 -0000

Hi Chris,

Thanks a lot for your answer. I've added Peter to CC, as he knows
about this ongoing project and some of the design decisions, like the
COW mapping, were already taken to some extent when I joined. Please
see my in-lined answers below.

On Wed, Apr 12, 2017 at 1:10 AM, Chris Torek <torek@elf.torek.net> wrote:
>>Yes, all vCPUs are locked before calling mmap(). I agree that we don't
>>need 'COW', as long as we keep all vCPUs locked while we copy the
>>entire VM memory. But this might take a while, imagine a VM with 32GB
>>or more of RAM. This will take maybe minutes to write to disk, so we
>>don't actually want the VM to be freezed for so long. That's the
>>reason we'd like to map the memory COW and then unlock vCPUs.
>
> You'll need to save the device state while holding the CPUs locked,
> too, so that the virtio queues can be in sync when you restore.

Yes, saving vCPU state, vlapic, ioapic etc is done with all vCPUs
locked. Memory, on the other hand, may be too large and take too much
time to copy. I am working right now on saving virtio queues and
device state.

>>It's a OBJT_DEFAULT. It's not a device object, it's the memory object
>>given to guest to use as physical memory.
>
> Your copy code path is basically a simplified vm_map_copy_entry()
> as called from vmspace_fork() for the MAP_INHERIT case.  But if
> these are OBJT_DEFAULT, shouldn't you be calling vm_object_collapse()?
> See https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/vm/vm_map.c#L3170
> (Maybe src_object->handle is never NULL?  There are several things
> in the VM object code that I do not understand fully here, so this
> might be the case.)

I saw those functions: vm_map_copy_entry() and vm_object_collapse(),
but I didn't have enough understanding of the whole system to be able
to tell if they might do some other things that we don't want them to.
I'll read them again after this e-mail.

>>>Next, how do you undo the damage done by your 'COW' ?
>
>>This is one thing that we've thought about, but we don't have a
>>solution for now. I agree it is very important, though. I figured that
>>it might be possible to 'unmark' the memory object as COW with some
>>additional tricks.
>
> I think you may be better off doing actual vm_map_copy_entry()
> calls.
>
> I am assuming, here, that snapshot-saving is implemented by
> sending a request to the running bhyve, which spins off a thread
> or process that does the snapshot-save.  If you spin it off as
> a real process, i.e., do a fork(), you will get the existing
> VM system to do all the work for you.  The overall strategy
> then looks something like this:
>
>     handle_external_suspend_or_snapshot_request() {
>         set global suspending flag /* if needed */
>         stop all vcpus
>         signal virtio and emulated devices to quiesce, if needed
>         if (snapshot) {
>             open snapshot file
>             pid = fork()
>             if (pid == 0) { /* child */
>                 COW is now in effect on memory: save more-volatile
>                     vcpu and dev state
>                 pthread_cond_signal parent that it's safe to resume
>                 save RAM state
>                 close snapshot file
>                 _exit(0)
>             }
>             if (pid < 0) ... handle error ...
>             /* parent */
>             close snapshot file
>             wait for child to signal OK to resume
>         } else {
>             wait for external resume signal
>         }
>         clear suspending flag
>         resume devices and vcpus
>     }
>
> To resume a snapshot from a file, we load its state and then run
> the last two steps (clear suspending flag and resume devices and
> vcpus).
>
> This way all the COW action happens through fork(), so there is no
> new kernel side code required

This looks perfect to me, this was one of my first questions when I
joined. However, I am not sure if it's ok to fork the entire bhyve
memory space, I remember that I've seen some discussion about this,
that's why I CCed Peter. Right now we have a checkpoint thread that
listens for the checkpoint signal (via a UNIX socket), then it
proceeds to locking the CPUs, saving some state, requests COW mapping
(via ioctl), unlocks vCPUs and copy COW memory to a checkpoint file. I
haven't done anything about unmapping the COW entry yet.

> (Frankly, I think the hard part here is saving device and virtual
> APIC state.  If you have the vlapic state saving working, you have
> made pretty good progress.)

Thanks. I am almost sure it is not complete yet, but I have vlapic
state saved. Actually, I am able to restore VMs using a ramdisk and no
devices except the console. I'd like to open a pull request for review
as soon as possible, but in the meantime I started looking on virtio
devices and save/restore virtio-net too.

--
Flavius