From owner-freebsd-hackers@freebsd.org Wed Apr 12 11:11:34 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A8899D3BE55 for ; Wed, 12 Apr 2017 11:11:34 +0000 (UTC) (envelope-from f.v.anton@gmail.com) Received: from mail-wm0-x230.google.com (mail-wm0-x230.google.com [IPv6:2a00:1450:400c:c09::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 61F3DF53; Wed, 12 Apr 2017 11:11:34 +0000 (UTC) (envelope-from f.v.anton@gmail.com) Received: by mail-wm0-x230.google.com with SMTP id w204so17963605wmd.1; Wed, 12 Apr 2017 04:11:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=01JWzR2h3/k+MbRsSu7klNwFdmVaCbHD6sYRuRRfEPw=; b=KK597+ncNU3vNRkDpAZotL6Nlg1MvPfNeweqQG3OCChnWc+bZRWcQZV+xCvvEcbNzE t2ku03xGEPfstF5DXPu+pVZ1w4fv9gtbbXm9y9datLaHVQPmRK5dsSH0vqF9FI/APayo bm4zTTopn4gEm3Pbrl3kYLhy3s0Cc9UDW7rVCj4leU+e382ctSjL+bL8u1SArU6CdwJs USsHb4wYeZmxEo+u4KRI+nnygEyA2uJNBRNEWLOD68oTaEu77u6XwQlO794/88AUuX7h n6D9UXa05p3vc6yNlvMLanhLW+bl+wIG/pHlj80d9pY1YdE8d2eJzpcnb//T76/pMjEQ FnoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=01JWzR2h3/k+MbRsSu7klNwFdmVaCbHD6sYRuRRfEPw=; b=nnbZdCG4sd1Vt2AK/kQeC91dw1TYkl9F+ivbrkUpUwKhBKww6m6dAndXPyJs/RjSFx LhXfEexY4CfQx/fyaS5/ueNLsBvrvD33x6t1Bq4r3q/1I3+8Iov9UmNemmXyuAIZqqRm RmrLG53opiQVy0rz8BspOW+5hvNwgEge1YM7I894gxxwtG4YW5Z2c+3r9VbC8+vhcNBz gLu6SIY/RyFwurVBPYTFY6KYqz1k+ykyZ5UEedqmim/Ezq0CCzYCvlynoXb7womqrKNO QFtmZai3Nq+iYptEyYxZ7mo2EsxF7nE8j6wtK1tjKeAgv1LC1txqpIS7aS3AtVsUS8h2 agew== X-Gm-Message-State: AN3rC/7nIYM0lYVjHBp4+TMIcNcTMJFzBfU+G1U0dXKLaw83JfpHd3jl hlOc6ubllGIfmkBmL58aLrPHLKCB/OAf X-Received: by 10.28.6.203 with SMTP id 194mr20107190wmg.125.1491995491399; Wed, 12 Apr 2017 04:11:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.178.10 with HTTP; Wed, 12 Apr 2017 04:11:30 -0700 (PDT) In-Reply-To: <201704112210.v3BMAVSe093702@elf.torek.net> References: <201704112210.v3BMAVSe093702@elf.torek.net> From: Flavius Anton Date: Wed, 12 Apr 2017 14:11:30 +0300 Message-ID: Subject: Re: On COW memory mapping in d_mmap_single To: Chris Torek , freebsd-hackers@freebsd.org Cc: Peter Grehan Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Apr 2017 11:11:34 -0000 Hi Chris, Thanks a lot for your answer. I've added Peter to CC, as he knows about this ongoing project and some of the design decisions, like the COW mapping, were already taken to some extent when I joined. Please see my in-lined answers below. On Wed, Apr 12, 2017 at 1:10 AM, Chris Torek wrote: >>Yes, all vCPUs are locked before calling mmap(). I agree that we don't >>need 'COW', as long as we keep all vCPUs locked while we copy the >>entire VM memory. But this might take a while, imagine a VM with 32GB >>or more of RAM. This will take maybe minutes to write to disk, so we >>don't actually want the VM to be freezed for so long. That's the >>reason we'd like to map the memory COW and then unlock vCPUs. > > You'll need to save the device state while holding the CPUs locked, > too, so that the virtio queues can be in sync when you restore. Yes, saving vCPU state, vlapic, ioapic etc is done with all vCPUs locked. Memory, on the other hand, may be too large and take too much time to copy. I am working right now on saving virtio queues and device state. >>It's a OBJT_DEFAULT. It's not a device object, it's the memory object >>given to guest to use as physical memory. > > Your copy code path is basically a simplified vm_map_copy_entry() > as called from vmspace_fork() for the MAP_INHERIT case. But if > these are OBJT_DEFAULT, shouldn't you be calling vm_object_collapse()? > See https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/vm/vm_map.c#L3170 > (Maybe src_object->handle is never NULL? There are several things > in the VM object code that I do not understand fully here, so this > might be the case.) I saw those functions: vm_map_copy_entry() and vm_object_collapse(), but I didn't have enough understanding of the whole system to be able to tell if they might do some other things that we don't want them to. I'll read them again after this e-mail. >>>Next, how do you undo the damage done by your 'COW' ? > >>This is one thing that we've thought about, but we don't have a >>solution for now. I agree it is very important, though. I figured that >>it might be possible to 'unmark' the memory object as COW with some >>additional tricks. > > I think you may be better off doing actual vm_map_copy_entry() > calls. > > I am assuming, here, that snapshot-saving is implemented by > sending a request to the running bhyve, which spins off a thread > or process that does the snapshot-save. If you spin it off as > a real process, i.e., do a fork(), you will get the existing > VM system to do all the work for you. The overall strategy > then looks something like this: > > handle_external_suspend_or_snapshot_request() { > set global suspending flag /* if needed */ > stop all vcpus > signal virtio and emulated devices to quiesce, if needed > if (snapshot) { > open snapshot file > pid = fork() > if (pid == 0) { /* child */ > COW is now in effect on memory: save more-volatile > vcpu and dev state > pthread_cond_signal parent that it's safe to resume > save RAM state > close snapshot file > _exit(0) > } > if (pid < 0) ... handle error ... > /* parent */ > close snapshot file > wait for child to signal OK to resume > } else { > wait for external resume signal > } > clear suspending flag > resume devices and vcpus > } > > To resume a snapshot from a file, we load its state and then run > the last two steps (clear suspending flag and resume devices and > vcpus). > > This way all the COW action happens through fork(), so there is no > new kernel side code required This looks perfect to me, this was one of my first questions when I joined. However, I am not sure if it's ok to fork the entire bhyve memory space, I remember that I've seen some discussion about this, that's why I CCed Peter. Right now we have a checkpoint thread that listens for the checkpoint signal (via a UNIX socket), then it proceeds to locking the CPUs, saving some state, requests COW mapping (via ioctl), unlocks vCPUs and copy COW memory to a checkpoint file. I haven't done anything about unmapping the COW entry yet. > (Frankly, I think the hard part here is saving device and virtual > APIC state. If you have the vlapic state saving working, you have > made pretty good progress.) Thanks. I am almost sure it is not complete yet, but I have vlapic state saved. Actually, I am able to restore VMs using a ramdisk and no devices except the console. I'd like to open a pull request for review as soon as possible, but in the meantime I started looking on virtio devices and save/restore virtio-net too. -- Flavius