Date: Sat, 27 Sep 1997 15:07:55 -0500 From: Karl Denninger <karl@mcs.net> To: Poul-Henning Kamp <phk@critter.freebsd.dk> Cc: Karl Denninger <karl@mcs.net>, Nate Williams <nate@mt.sri.com>, current@freebsd.org Subject: Re: WARNING! Builds from the last few days have BROKEN NFS Message-ID: <19970927150755.39452@Mars.Mcs.Net> In-Reply-To: <13496.875390123@critter.freebsd.dk>; from Poul-Henning Kamp on Sat, Sep 27, 1997 at 09:55:23PM %2B0200 References: <19970927145131.64000@Mars.Mcs.Net> <13496.875390123@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 27, 1997 at 09:55:23PM +0200, Poul-Henning Kamp wrote:
> >>
> >> 'Just do it, and quit 'yer arguing' :) :) :) :)
> >
> >What I don't understand is why this kind of change would have the effect
> >that it is having. That is, why would reducing the target number of vnodes
> >on the freelist lead to hangs in a disk wait ("D") which are unkillable?
>
> Karl, neither do any of us at this time.
>
> We may be pretty good wizards, but we're not gods any of us.
>
> We need more data, and that involves you doing as we say in an attempt
> to figure out what the heck is going on.
>
> I'm running a make buildworld across NFS right now, but so far I see no
> trouble along the lines of what you found.
>
> --
> Poul-Henning Kamp FreeBSD coreteam member
> phk@FreeBSD.ORG "Real hackers run -current on their laptop."
Ok -- tomorrow evening I will check out another copy of the current sources,
and give this another shot with the parameters you're referring to.
The problem shows up only during fairly heavy I/O load -- my initial tests
didn't show it, but putting the code on a reasonably-busy web server does,
and its easily reproduced in about 20-30 minutes.
Same with the shell systems here. The symptom is that a single disk I/O
request will hang in a "D" state. Further attempts to access that same
object then also hang, but others, even to the same disk pack, do NOT.
That is, a "df" still works, but a "cat <object>" locks up. If "object"
is a directory then a "ls" will freeze. If its a file then you have
to reference the specific file to see the behavior. Over a fairly short
period of time once this starts you're in *big* trouble; you'll end up with
thousands of processes hung in a disk wait for a specific file, and
eventually run out of either process slots or page space (most people retry
failed accesses, which makes the problem worse). Per-user process limits
(which I have turned off on these machines) would stop some of the damage,
but not all.
I was trying to resolve cache inconsistency problems with NFS when I ran
headfirst into this. There is a problem with V3 mounts (the default now)
where you can "mv" a file on one client, and another client never sees the
change. This is particularly distressing when you "mv" the access_log
file from a web server (from another amchine), kick the server to re-create
the access_log file, and then find that it never shows up on the other
syste (or does with zero length, but no data in it -- ever).
If you look on the other system, a "ls" doesn't show the errant file.
But a "cat" does -- the data is still there. Needless to say this is
pretty troublesome, and leads to lots of head-scratching.
I thought that perhaps the V3 code was bolluxed, so I rebuilt the kernel
to the most current revision and then tried to run with V2mounts ("-2"
switch in the /etc/fstab file). That's when I ran into the trouble;
removing the "-2" switch had no effect -- dashing my first hypothesis
that the V2 code had been broken for some time and I hadn't noticed it.
I'll see what else I can find out.
--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/~karl | T1's from $600 monthly to FULL DS-3 Service
| NEW! K56Flex modem support is now available
Voice: [+1 312 803-MCS1 x219]| 56kbps DIGITAL ISDN DOV on analog lines!
Fax: [+1 312 803-4929] | 2 FULL DS-3 Internet links; 400Mbps B/W Internal
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19970927150755.39452>
