From owner-freebsd-hackers  Thu Oct 26 08:03:28 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.6.12/8.6.6) id IAA08990
          for hackers-outgoing; Thu, 26 Oct 1995 08:03:28 -0700
Received: from terra.Sarnoff.COM (terra.sarnoff.com [130.33.11.203])
          by freefall.freebsd.org (8.6.12/8.6.6) with ESMTP id IAA08984
          for <hackers@freebsd.org>; Thu, 26 Oct 1995 08:03:16 -0700
Received: (from rminnich@localhost) by terra.Sarnoff.COM (8.6.12/8.6.12) id LAA08336; Thu, 26 Oct 1995 11:01:38 -0400
Date: Thu, 26 Oct 1995 11:01:37 -0400 (EDT)
From: "Ron G. Minnich" <rminnich@Sarnoff.COM>
To: hackers@freebsd.org, Chuck Cranor <chuck@maria.wustl.edu>,
        Theo de Raadt <deraadt@theos.com>
Subject: msync is basically busted still (2.1 snap)
Message-ID: <Pine.SUN.3.91.951026100617.7542D-100000@terra>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-hackers@freebsd.org
Precedence: bulk

well, i've been fighting msync in 2.05 now for a bit, trying to get 
proper behaviour. It doesn't work. A quick walk through 2.1 shows the 
same set of problems. 

description: given the following code: 
main()
{ 

   int f = open("/some_nfs_file", 2);
   char *c = mmap(that file);
   char *x = *c;
   msync(c, sizeof(*c), MS_INVALIDATE);
   *x = *c;
}

I should see: nfs read, (for *c), nfs write (for *msync), nfs read (for 
*c). What i see is: nfs read. That's it. The write gets swallowed up by 
the kernel, and even if i force the write to happen by hacking the code, 
the second read won't happen anyway. 

SO let's look at msync: 

msync(p, uap, retval)
        struct proc *p;
        struct msync_args *uap;
        int *retval;


msync just calls: 

        /*
         * Clean the pages and interpret the return value.
         */
        rv = vm_map_clean(map, addr, addr + size, (flags & MS_ASYNC) == 0,
            (flags & MS_INVALIDATE) != 0);

note the last flag is invalidate. 

after contortions in vm_map_clean we see: 
                       if (current->protection & VM_PROT_WRITE)
                                vm_object_page_clean(object, offset, 
					offset + size, syncio);

note the invalidate flag has just gotten lost. SHouldn't matter, right? 
wrong. The underlying objects need to know invalidate in the event that 
they cache. They need  a flag that says: blow this data out of your 
buffer cache. Or, in recognition of VM caching, we need to decide if 
buffer caches still make sense. 
So this is the first problem: 
1) we should pass invalidate info to underlying objects

Back to our story. vm_object_page_clean does (for vnodes):
                VOP_LOCK(vp);
                _vm_object_page_clean(object, start, end, syncio);
                VOP_UNLOCK(vp);

and _vm_object_page_clean does (after much work): 
                 tincr = vm_pageout_clean(p, VM_PAGEOUT_FORCE);

A minor problem is we're not really doing more than page io at this point,
which is going to (later down) require us to be smart and try to cluster
IO. We've tossed information away at the higher levels, and we're going to
do work at the lower levels to recreate it. This translates to "we'll be
slow". think of this when you run lmbench and get beat by Linux :-(

vm_pageout_clean does another very large blob of work, including trying 
to group some pages, but finally: 
                if (pageout_count == 1) {
                        pageout_status[0] = pager ?
                            vm_pager_put(pager, m,
                            ((sync || (object == kernel_object)) ? TRUE : 
			etc.

at some point, we get to: 
/*                      
 * generic vnode pager output routine 
 */                             
int 
vnode_pager_output(vnp, m, count, rtvals)

if all has gone well we have a list of pages in m to output, and a count. 
Any statistics on how often the count is > 1? 

We no longer really have any knowledge of: should the file system try to 
cache the data or not, is the io synchronous, etc. It's gone. This is the 
Mach way of doing things: layers with information hiding. I think it's 
busted: i prefer systems that add more and more info as you go down the 
layers, so less recomputing is done. There's an amazing amount of 
redundant computation in the msync path so far. Some of it is because 
information seems to be lost on the way down. (that said, the mach vm is 
sure nicer than what preceded it ...) But we're never going to go fast 
this way. And going fast is what we want. 

Anyway, from vnode_pager_output: 
        aiov.iov_base = (caddr_t) 0;
        aiov.iov_len = maxsize;
        auio.uio_iov = &aiov;
        auio.uio_iovcnt = 1;
        auio.uio_offset = m[0]->offset;
        auio.uio_segflg = UIO_NOCOPY;
        auio.uio_rw = UIO_WRITE;
        auio.uio_resid = maxsize;
        auio.uio_procp = (struct proc *) 0; 
        error = VOP_WRITE(vp, &auio, IO_VMIO, curproc->p_ucred);

Finally! IO, right? Maybe ...

this will go to nfs_write: we something else weird: 
        if (ioflag & (IO_APPEND | IO_SYNC)) {
                if (np->n_flag & NMODIFIED) {
                        np->n_attrstamp = 0;
                        error = nfs_vinvalbuf(vp, V_SAVE, cred, p, 1);
                        if (error)
                                return (error);
                }
ouch! append or sync io means invalidate ALL buffered data? even in the VM
cache? Yup, even in the vm cache: nfs_vinvalbuf calls vinvalbuf which will
try to clean buffered pages. It is possible (i just exercised this path
yesterday while fooling around) to have msync call the vm system, which
calls nfs_write, which will in turn call the vm system, and you'll
deadlock because you've got stuff locked because you're doing an msync.
oooch, ouch. 

In the case of standard msync though, io append and/or sync are not set, 
so we keep going and see: 

                /*
                 * If the new write will leave a contiguous dirty
                 * area, just update the b_dirtyoff and b_dirtyend,
                 * otherwise force a write rpc of the old dirty area.
                 */     

OUCH! so msync has no way of ensuring that io will actually happen! 
That's why msync won't work for me ...

Of course there's another path below this test, ...

                /*              
                 * If the lease is non-cachable or IO_SYNC do bwrite().
                 */     
                if ((np->n_flag & NQNFSNONCACHE) || (ioflag & IO_SYNC)) {
                        bp->b_proc = p;
                        error = VOP_BWRITE(bp);
                        if (error) 
                                return (error);

we can't get here if IO_SYNC is set! we already tested for it and took a
different path way up at the top of nfs_write. Not only that the cache is
not getting invalidated, and i have not found a reasonable way to do it
that doesn't pull in the vm system and cause deadlock. 

now what? msync still won't work in 2.1 ...  but the bigger problem is, 
the vm system overall seems overloaded with work. 

ron