From owner-freebsd-hackers@freebsd.org Tue Apr 11 13:55:04 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 50646D39B13 for ; Tue, 11 Apr 2017 13:55:04 +0000 (UTC) (envelope-from f.v.anton@gmail.com) Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com [IPv6:2a00:1450:400c:c09::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CEA683E3 for ; Tue, 11 Apr 2017 13:55:03 +0000 (UTC) (envelope-from f.v.anton@gmail.com) Received: by mail-wm0-x235.google.com with SMTP id y18so13906783wmh.0 for ; Tue, 11 Apr 2017 06:55:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=u7sLiW9JSsAD/x7usLCGROHZF40bWzgGIItUmh5O1HY=; b=o3paD3Nd7QOOJmGpfSMRy65xrn+TxyX3RPgDLb454ISBnJ0rPf7LxsRPCmkrZFX0t8 j1lRfShzjuRxcsgf85hBixS434YMO/7SboyWBmVLE9b5TEmB5eJ+uo1fr9I5Ee5nznR+ 1au54a4AJx/pkBe1glxv9mxBt/diCgseLfnLqOxNv4TkAiqcmOP70Csv7oQqdusPqKHj ZMHGyhKigxbQeBo/RApnZDLIdJMYgxR/MY6g25mk0vCfn2tMVYromaYtK590GatX6OxW o49yhVD/+6/R+6LQ0PHK3oFBj//wFmhC+KZPYh3nNhmxQXQMnqGxi06HAZ3km6wTEnKQ hs7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=u7sLiW9JSsAD/x7usLCGROHZF40bWzgGIItUmh5O1HY=; b=gr9OH4517osGZ85W/sGn0J7GXSh3fzopR/fHAzG56XnlLthtPS1Btlv1bPmIaZ6PQC 3jBaSuev32s8X/Lb5nifouMh1MPa6a5lnlU1SpH4BbRAeqToPn6OkA7TngbMBvlk81aF pEsBG11wkZ6D/Xd2RIl4Fk2c241r8SECHC4cW2hc8lfdiTfZ/jPw0CgQTFGMNnZBln9J kW3t0GKeWEs6gCi+uTQdo+oXnfksyefEpV/b5YBGJ0+ZfIKYTO/k1xETyxBxVoi0zNdg HpX0WLXEkqDjoVgSuIieUEc4YbcVuEHAKcji5bRpJkW0IY+u6OpHnQkHerZNjqgOPAMy 7oDw== X-Gm-Message-State: AN3rC/4prRnTdApqPJn/CuCUbLK35xyK2FvM9mGodqCzPvK13s+0bnJzwKvwuCe7FLfBQwNUB6lxerSDdgi2ag== X-Received: by 10.28.7.144 with SMTP id 138mr15014079wmh.125.1491918900958; Tue, 11 Apr 2017 06:55:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.178.10 with HTTP; Tue, 11 Apr 2017 06:55:00 -0700 (PDT) In-Reply-To: References: From: Flavius Anton Date: Tue, 11 Apr 2017 16:55:00 +0300 Message-ID: Subject: Re: On COW memory mapping in d_mmap_single To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Apr 2017 13:55:04 -0000 >On Tue, Apr 11, 2017 at 04:00:21PM +0300, Konstantin Belousov wrote: >>On Tue, Apr 11, 2017 at 03:37:26PM +0300, Flavius Anton wrote: >> Hi everyone, >> >> I'll start by giving some context, so you can better understand what >> is the problem I'm trying to solve. I???ve been working for a while on >> bhyve trying to implement save/restore [1]. We've currently managed to >> get it working for VMs using a ramdisk and no devices, so just vCPU >> and memory states are saved and restored so far. >> >> Last week I started looking into network devices, specifically >> virtio-net devices. The problem was that when I issue a checkpoint >> operation, the guest virtio driver stops working. After digging for a >> while, I figured out the problem is marking VM memory as COW. If I >> don't do this, the driver continues with no problem after >> checkpointing. >> >> Each VM has an associated vmspace and a /dev/vmm/VM_NAME device. When >> the user space does a mmap on the /dev device, we would like to mark >> VM memory as COW, thus the VM can continue touching pages while the >> user space is writing the 'freezed', COW marked memory to a persistent >> storage. We do this by iterating through all vm_entries from VM's >> vmspace, we find which entry is mapping the object that has VM memory >> and then we roughly just set MAP_ENTRY_COW and MAP_ENTRY_NEEDS_COPY on >> that entry. You can see the code here [2]. > >This is very strange operation, to put it mildly. First, are other vCPUs >operate while you do your 'COW' ? If yes, you are guaranteed to get >inconsistent snapshot. If not, then you do not need 'COW'. Yes, all vCPUs are locked before calling mmap(). I agree that we don't need 'COW', as long as we keep all vCPUs locked while we copy the entire VM memory. But this might take a while, imagine a VM with 32GB or more of RAM. This will take maybe minutes to write to disk, so we don't actually want the VM to be freezed for so long. That's the reason we'd like to map the memory COW and then unlock vCPUs. >More, what kinds of VM objects are mapped into the vmspace ? FreeBSD VM >does not support shadowing of device objects (which means, inserting >shadow objects into the device object chain breaks VM invariants). One >of the main reasons why it not needed to be supported is because shadow >copy cannot see changes which are performed on the shadowed pages, >supposedly done by device. If vmm mmaps some devices into guest vmspace, >the devices would kind of 'freeze' from the guest PoV. It's a OBJT_DEFAULT. It's not a device object, it's the memory object given to guest to use as physical memory. >Next, how do you undo the damage done by your 'COW' ? This is one thing that we've thought about, but we don't have a solution for now. I agree it is very important, though. I figured that it might be possible to 'unmark' the memory object as COW with some additional tricks. >> I'm not sure if the above is sufficient for our purpose. In other >> words, how would you do this? You have a vm_object that is referenced >> via a vm_entry by process A (the user space). Somebody else, process B >> let's say, does an mmap() on your device and you'd like to freeze that >> object, such that process B can see a consistent snapshot of it, while >> you want process A to be able to continue reading and writing from/to >> it. >This is not supported. I have no idea why would a copy of a page which >reflects the device state even considered as a good idea. But you cannot >make the consistent copy without device cooperation anyway, since device >might modify its state while CPU reads. I'm sorry if I haven't been too clear. The object that I'm trying to map as COW is not a device object. It's just the object that contains VM memory. That object shouldn't change if all VM vCPUs are locked and I make sure they are when calling mmap(). Thanks for your input on this. -- Flavius >> I've also read through Design Elements of the FreeBSD VM system [3], >> but I am still afraid (I am sure) that I have some misunderstandings. >> >> Thank you very much for bearing with me and going through this wall of text. >> >> [1] https://github.com/flaviusanton/freebsd/tree/bhyve-save-restore >> [2] https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/amd64/vmm/vmm_dev.c#L862 >> [3] https://www.freebsd.org/doc/en/articles/vm-design/index.html