From owner-freebsd-current  Sun Dec 14 09:16:49 1997
Return-Path: <owner-freebsd-current>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id JAA20208
          for current-outgoing; Sun, 14 Dec 1997 09:16:49 -0800 (PST)
          (envelope-from owner-freebsd-current)
Received: from skynet.ctr.columbia.edu (skynet.ctr.columbia.edu [128.59.64.70])
          by hub.freebsd.org (8.8.7/8.8.7) with SMTP id JAA20195;
          Sun, 14 Dec 1997 09:16:34 -0800 (PST)
          (envelope-from wpaul@skynet.ctr.columbia.edu)
Received: (from wpaul@localhost) by skynet.ctr.columbia.edu (8.6.12/8.6.9) id MAA13140; Sun, 14 Dec 1997 12:17:38 -0500
From: Bill Paul <wpaul@skynet.ctr.columbia.edu>
Message-Id: <199712141717.MAA13140@skynet.ctr.columbia.edu>
Subject: Re: mmap() + NFS problems persist
To: dfr@nlsystems.com
Date: Sun, 14 Dec 1997 12:17:37 -0500 (EST)
Cc: current@freebsd.org, toor@dyson.iquest.net, dyson@freebsd.org
In-Reply-To: <Pine.BSF.3.95q.971213101402.514A-100000@herring.nlsystems.com> from "Doug Rabson" at Dec 13, 97 10:23:42 am
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Sender: owner-freebsd-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Of all the gin joints in all the towns in all the world, Doug Rabson had 
to walk into mine and say:

[...]
 
> I think I understand what might be happening.  I can't easily check since
> my FreeBSD hacking box is at work though.  What I think happens is that
> when brelse is called in this code fragment,
> 
> >                 if (not_readin && n > 0) {
> >                         if (on < bp->b_validoff || (on + n) > 
> > bp->b_validend) {
> >                                 bp->b_flags |= B_NOCACHE;
> >                                 bp->b_flags |= B_INVAFTERWRITE;
> >                                 if (bp->b_dirtyend > 0) {
> >                                     if ((bp->b_flags & B_DELWRI) == 0)
> >                                         panic("nfsbioread");
> >                                     if (VOP_BWRITE(bp) == EINTR)
> >                                         return (EINTR);
> >                                 } else
> >                                     brelse(bp);
> >                                 goto again;  <----- LOOPS HERE!!
> 
> the 8k buffer has exactly one VM page associated with it.

Err... with all due respect, that means it's really a 4K buffer, not
an 8K buffer, yes? If so, assuming the NFS layer did an 8K read, where
did the other 4K go?

> The NFS code is
> attempting to throw the buffer away since it is only partially valid and
> it wants to read from the invalid section of the buf.  It does this by
> setting the B_NOCACHE flag before calling brelse.  Unfortunately the
> underlying VM page is still valid, so when getblk is called, the code in
> allocbuf which tracks down the underlying VM pages carefully resets
> b_validoff and b_validend causing the loop.
> 
> Basically, the VMIO system has managed to ignore the NFS code's request
> for a cache flush, which the NFS code relied on to break the loop in
> nfs_bioread.  As I see it, the problem can be fixed in two ways.  The
> first would be for brelse() on a B_NOCACHE buffer to invalidate the VM
> pages in the buffer, restoring the old behaviour which NFS expected and
> the second would be to rewrite that section of the NFS client to cope
> differently with partially valid buffers.

Hmmm... I think I see the code in vfs_bio.c:brelse() that has lead to
this, but the comments seem to indicate that reverting it would be a bug.

There's a couple things I don't understand. You seem to indicate that
setting the B_NOCACHE flag will cause brelse() to flush the cache, but
I didn't know brelse() did that. Also, bear in mind that the first
4K block that's in core is now dirty, so having brelse() throw it away
would be wrong, unless it forced the first 4K block to be written out
first, but again I don't see where that happens (unless brelse() just
sets up the block to be written out and getblk() actually does it).

What is the correct action here? Should the dirty page be written out
first, then the buffer invalidated and the next 4K page read in? Or
should we write the dirty page but keep the buffer around and load the
next 4K page into another buffer? Or should both pages be combined into
a single 8K block? Should we not even bother to write the dirty page
out yet and just make sure the next 4K block is loaded correctly?

(And is this stuff related to the other problem where the process can
become stuck sleeping on 'vmopar?')

I think I'm going to take a trip back to campus today so I can experiment
a bit more with the test box. (It's not like I have anything else to do
today.)

-Bill

-- 
=============================================================================
-Bill Paul            (212) 854-6020 | System Manager, Master of Unix-Fu
Work:         wpaul@ctr.columbia.edu | Center for Telecommunications Research
Home:  wpaul@skynet.ctr.columbia.edu | Columbia University, New York City
=============================================================================
 "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness"
=============================================================================