From owner-freebsd-hackers@freebsd.org Tue Apr 11 22:10:38 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0553FD3ADB0 for ; Tue, 11 Apr 2017 22:10:38 +0000 (UTC) (envelope-from torek@elf.torek.net) Received: from elf.torek.net (mail.torek.net [96.90.199.121]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "elf.torek.net", Issuer "elf.torek.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id D59A3957 for ; Tue, 11 Apr 2017 22:10:37 +0000 (UTC) (envelope-from torek@elf.torek.net) Received: from elf.torek.net (localhost [127.0.0.1]) by elf.torek.net (8.15.2/8.15.2) with ESMTPS id v3BMAVhu093703 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 11 Apr 2017 15:10:31 -0700 (PDT) (envelope-from torek@elf.torek.net) Received: (from torek@localhost) by elf.torek.net (8.15.2/8.15.2/Submit) id v3BMAVSe093702; Tue, 11 Apr 2017 15:10:31 -0700 (PDT) (envelope-from torek) Date: Tue, 11 Apr 2017 15:10:31 -0700 (PDT) From: Chris Torek Message-Id: <201704112210.v3BMAVSe093702@elf.torek.net> To: f.v.anton@gmail.com, freebsd-hackers@freebsd.org Subject: Re: On COW memory mapping in d_mmap_single In-Reply-To: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (elf.torek.net [127.0.0.1]); Tue, 11 Apr 2017 15:10:31 -0700 (PDT) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Apr 2017 22:10:38 -0000 >Yes, all vCPUs are locked before calling mmap(). I agree that we don't >need 'COW', as long as we keep all vCPUs locked while we copy the >entire VM memory. But this might take a while, imagine a VM with 32GB >or more of RAM. This will take maybe minutes to write to disk, so we >don't actually want the VM to be freezed for so long. That's the >reason we'd like to map the memory COW and then unlock vCPUs. You'll need to save the device state while holding the CPUs locked, too, so that the virtio queues can be in sync when you restore. >It's a OBJT_DEFAULT. It's not a device object, it's the memory object >given to guest to use as physical memory. Your copy code path is basically a simplified vm_map_copy_entry() as called from vmspace_fork() for the MAP_INHERIT case. But if these are OBJT_DEFAULT, shouldn't you be calling vm_object_collapse()? See https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/vm/vm_map.c#L3170 (Maybe src_object->handle is never NULL? There are several things in the VM object code that I do not understand fully here, so this might be the case.) >>Next, how do you undo the damage done by your 'COW' ? >This is one thing that we've thought about, but we don't have a >solution for now. I agree it is very important, though. I figured that >it might be possible to 'unmark' the memory object as COW with some >additional tricks. I think you may be better off doing actual vm_map_copy_entry() calls. I am assuming, here, that snapshot-saving is implemented by sending a request to the running bhyve, which spins off a thread or process that does the snapshot-save. If you spin it off as a real process, i.e., do a fork(), you will get the existing VM system to do all the work for you. The overall strategy then looks something like this: handle_external_suspend_or_snapshot_request() { set global suspending flag /* if needed */ stop all vcpus signal virtio and emulated devices to quiesce, if needed if (snapshot) { open snapshot file pid = fork() if (pid == 0) { /* child */ COW is now in effect on memory: save more-volatile vcpu and dev state pthread_cond_signal parent that it's safe to resume save RAM state close snapshot file _exit(0) } if (pid < 0) ... handle error ... /* parent */ close snapshot file wait for child to signal OK to resume } else { wait for external resume signal } clear suspending flag resume devices and vcpus } To resume a snapshot from a file, we load its state and then run the last two steps (clear suspending flag and resume devices and vcpus). This way all the COW action happens through fork(), so there is no new kernel side code required (Frankly, I think the hard part here is saving device and virtual APIC state. If you have the vlapic state saving working, you have made pretty good progress.) Chris