From owner-freebsd-hackers Thu Oct 26 08:03:28 1995 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.6.12/8.6.6) id IAA08990 for hackers-outgoing; Thu, 26 Oct 1995 08:03:28 -0700 Received: from terra.Sarnoff.COM (terra.sarnoff.com [130.33.11.203]) by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id IAA08984 for ; Thu, 26 Oct 1995 08:03:16 -0700 Received: (from rminnich@localhost) by terra.Sarnoff.COM (8.6.12/8.6.12) id LAA08336; Thu, 26 Oct 1995 11:01:38 -0400 Date: Thu, 26 Oct 1995 11:01:37 -0400 (EDT) From: "Ron G. Minnich" To: hackers@freebsd.org, Chuck Cranor , Theo de Raadt Subject: msync is basically busted still (2.1 snap) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-hackers@freebsd.org Precedence: bulk well, i've been fighting msync in 2.05 now for a bit, trying to get proper behaviour. It doesn't work. A quick walk through 2.1 shows the same set of problems. description: given the following code: main() { int f = open("/some_nfs_file", 2); char *c = mmap(that file); char *x = *c; msync(c, sizeof(*c), MS_INVALIDATE); *x = *c; } I should see: nfs read, (for *c), nfs write (for *msync), nfs read (for *c). What i see is: nfs read. That's it. The write gets swallowed up by the kernel, and even if i force the write to happen by hacking the code, the second read won't happen anyway. SO let's look at msync: msync(p, uap, retval) struct proc *p; struct msync_args *uap; int *retval; msync just calls: /* * Clean the pages and interpret the return value. */ rv = vm_map_clean(map, addr, addr + size, (flags & MS_ASYNC) == 0, (flags & MS_INVALIDATE) != 0); note the last flag is invalidate. after contortions in vm_map_clean we see: if (current->protection & VM_PROT_WRITE) vm_object_page_clean(object, offset, offset + size, syncio); note the invalidate flag has just gotten lost. SHouldn't matter, right? wrong. The underlying objects need to know invalidate in the event that they cache. They need a flag that says: blow this data out of your buffer cache. Or, in recognition of VM caching, we need to decide if buffer caches still make sense. So this is the first problem: 1) we should pass invalidate info to underlying objects Back to our story. vm_object_page_clean does (for vnodes): VOP_LOCK(vp); _vm_object_page_clean(object, start, end, syncio); VOP_UNLOCK(vp); and _vm_object_page_clean does (after much work): tincr = vm_pageout_clean(p, VM_PAGEOUT_FORCE); A minor problem is we're not really doing more than page io at this point, which is going to (later down) require us to be smart and try to cluster IO. We've tossed information away at the higher levels, and we're going to do work at the lower levels to recreate it. This translates to "we'll be slow". think of this when you run lmbench and get beat by Linux :-( vm_pageout_clean does another very large blob of work, including trying to group some pages, but finally: if (pageout_count == 1) { pageout_status[0] = pager ? vm_pager_put(pager, m, ((sync || (object == kernel_object)) ? TRUE : etc. at some point, we get to: /* * generic vnode pager output routine */ int vnode_pager_output(vnp, m, count, rtvals) if all has gone well we have a list of pages in m to output, and a count. Any statistics on how often the count is > 1? We no longer really have any knowledge of: should the file system try to cache the data or not, is the io synchronous, etc. It's gone. This is the Mach way of doing things: layers with information hiding. I think it's busted: i prefer systems that add more and more info as you go down the layers, so less recomputing is done. There's an amazing amount of redundant computation in the msync path so far. Some of it is because information seems to be lost on the way down. (that said, the mach vm is sure nicer than what preceded it ...) But we're never going to go fast this way. And going fast is what we want. Anyway, from vnode_pager_output: aiov.iov_base = (caddr_t) 0; aiov.iov_len = maxsize; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_offset = m[0]->offset; auio.uio_segflg = UIO_NOCOPY; auio.uio_rw = UIO_WRITE; auio.uio_resid = maxsize; auio.uio_procp = (struct proc *) 0; error = VOP_WRITE(vp, &auio, IO_VMIO, curproc->p_ucred); Finally! IO, right? Maybe ... this will go to nfs_write: we something else weird: if (ioflag & (IO_APPEND | IO_SYNC)) { if (np->n_flag & NMODIFIED) { np->n_attrstamp = 0; error = nfs_vinvalbuf(vp, V_SAVE, cred, p, 1); if (error) return (error); } ouch! append or sync io means invalidate ALL buffered data? even in the VM cache? Yup, even in the vm cache: nfs_vinvalbuf calls vinvalbuf which will try to clean buffered pages. It is possible (i just exercised this path yesterday while fooling around) to have msync call the vm system, which calls nfs_write, which will in turn call the vm system, and you'll deadlock because you've got stuff locked because you're doing an msync. oooch, ouch. In the case of standard msync though, io append and/or sync are not set, so we keep going and see: /* * If the new write will leave a contiguous dirty * area, just update the b_dirtyoff and b_dirtyend, * otherwise force a write rpc of the old dirty area. */ OUCH! so msync has no way of ensuring that io will actually happen! That's why msync won't work for me ... Of course there's another path below this test, ... /* * If the lease is non-cachable or IO_SYNC do bwrite(). */ if ((np->n_flag & NQNFSNONCACHE) || (ioflag & IO_SYNC)) { bp->b_proc = p; error = VOP_BWRITE(bp); if (error) return (error); we can't get here if IO_SYNC is set! we already tested for it and took a different path way up at the top of nfs_write. Not only that the cache is not getting invalidated, and i have not found a reasonable way to do it that doesn't pull in the vm system and cause deadlock. now what? msync still won't work in 2.1 ... but the bigger problem is, the vm system overall seems overloaded with work. ron