Date: Sat, 27 Sep 1997 15:07:55 -0500 From: Karl Denninger <karl@mcs.net> To: Poul-Henning Kamp <phk@critter.freebsd.dk> Cc: Karl Denninger <karl@mcs.net>, Nate Williams <nate@mt.sri.com>, current@freebsd.org Subject: Re: WARNING! Builds from the last few days have BROKEN NFS Message-ID: <19970927150755.39452@Mars.Mcs.Net> In-Reply-To: <13496.875390123@critter.freebsd.dk>; from Poul-Henning Kamp on Sat, Sep 27, 1997 at 09:55:23PM %2B0200 References: <19970927145131.64000@Mars.Mcs.Net> <13496.875390123@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 27, 1997 at 09:55:23PM +0200, Poul-Henning Kamp wrote:
> >> 
> >> 'Just do it, and quit 'yer arguing' :) :) :) :)
> >
> >What I don't understand is why this kind of change would have the effect
> >that it is having.  That is, why would reducing the target number of vnodes
> >on the freelist lead to hangs in a disk wait ("D") which are unkillable?
> 
> Karl, neither do any of us at this time.
> 
> We may be pretty good wizards, but we're not gods any of us.
> 
> We need more data, and that involves you doing as we say in an attempt
> to figure out what the heck is going on.
> 
> I'm running a make buildworld across NFS right now, but so far I see no
> trouble along the lines of what you found.
> 
> --
> Poul-Henning Kamp             FreeBSD coreteam member
> phk@FreeBSD.ORG               "Real hackers run -current on their laptop."
Ok -- tomorrow evening I will check out another copy of the current sources,
and give this another shot with the parameters you're referring to.
The problem shows up only during fairly heavy I/O load -- my initial tests
didn't show it, but putting the code on a reasonably-busy web server does,
and its easily reproduced in about 20-30 minutes.
Same with the shell systems here.  The symptom is that a single disk I/O
request will hang in a "D" state.  Further attempts to access that same
object then also hang, but others, even to the same disk pack, do NOT.
That is, a "df" still works, but a "cat <object>" locks up.  If "object"
is a directory then a "ls" will freeze.  If its a file then you have
to reference the specific file to see the behavior.  Over a fairly short
period of time once this starts you're in *big* trouble; you'll end up with
thousands of processes hung in a disk wait for a specific file, and
eventually run out of either process slots or page space (most people retry
failed accesses, which makes the problem worse).  Per-user process limits
(which I have turned off on these machines) would stop some of the damage,
but not all.
I was trying to resolve cache inconsistency problems with NFS when I ran
headfirst into this.  There is a problem with V3 mounts (the default now)
where you can "mv" a file on one client, and another client never sees the
change.  This is particularly distressing when you "mv" the access_log
file from a web server (from another amchine), kick the server to re-create
the access_log file, and then find that it never shows up on the other
syste (or does with zero length, but no data in it -- ever).  
If you look on the other system, a "ls" doesn't show the errant file.
But a "cat" does -- the data is still there.  Needless to say this is
pretty troublesome, and leads to lots of head-scratching.
I thought that perhaps the V3 code was bolluxed, so I rebuilt the kernel
to the most current revision and then tried to run with V2mounts ("-2"
switch in the /etc/fstab file).  That's when I ran into the trouble;
removing the "-2" switch had no effect -- dashing my first hypothesis
that the V2 code had been broken for some time and I hadn't noticed it.
I'll see what else I can find out.
--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/~karl     | T1's from $600 monthly to FULL DS-3 Service
			     | NEW! K56Flex modem support is now available
Voice: [+1 312 803-MCS1 x219]| 56kbps DIGITAL ISDN DOV on analog lines!
Fax:   [+1 312 803-4929]     | 2 FULL DS-3 Internet links; 400Mbps B/W Internal
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19970927150755.39452>
