Date: Sat, 27 Sep 1997 15:07:55 -0500 From: Karl Denninger <karl@mcs.net> To: Poul-Henning Kamp <phk@critter.freebsd.dk> Cc: Karl Denninger <karl@mcs.net>, Nate Williams <nate@mt.sri.com>, current@freebsd.org Subject: Re: WARNING! Builds from the last few days have BROKEN NFS Message-ID: <19970927150755.39452@Mars.Mcs.Net> In-Reply-To: <13496.875390123@critter.freebsd.dk>; from Poul-Henning Kamp on Sat, Sep 27, 1997 at 09:55:23PM %2B0200 References: <19970927145131.64000@Mars.Mcs.Net> <13496.875390123@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Sep 27, 1997 at 09:55:23PM +0200, Poul-Henning Kamp wrote: > >> > >> 'Just do it, and quit 'yer arguing' :) :) :) :) > > > >What I don't understand is why this kind of change would have the effect > >that it is having. That is, why would reducing the target number of vnodes > >on the freelist lead to hangs in a disk wait ("D") which are unkillable? > > Karl, neither do any of us at this time. > > We may be pretty good wizards, but we're not gods any of us. > > We need more data, and that involves you doing as we say in an attempt > to figure out what the heck is going on. > > I'm running a make buildworld across NFS right now, but so far I see no > trouble along the lines of what you found. > > -- > Poul-Henning Kamp FreeBSD coreteam member > phk@FreeBSD.ORG "Real hackers run -current on their laptop." Ok -- tomorrow evening I will check out another copy of the current sources, and give this another shot with the parameters you're referring to. The problem shows up only during fairly heavy I/O load -- my initial tests didn't show it, but putting the code on a reasonably-busy web server does, and its easily reproduced in about 20-30 minutes. Same with the shell systems here. The symptom is that a single disk I/O request will hang in a "D" state. Further attempts to access that same object then also hang, but others, even to the same disk pack, do NOT. That is, a "df" still works, but a "cat <object>" locks up. If "object" is a directory then a "ls" will freeze. If its a file then you have to reference the specific file to see the behavior. Over a fairly short period of time once this starts you're in *big* trouble; you'll end up with thousands of processes hung in a disk wait for a specific file, and eventually run out of either process slots or page space (most people retry failed accesses, which makes the problem worse). Per-user process limits (which I have turned off on these machines) would stop some of the damage, but not all. I was trying to resolve cache inconsistency problems with NFS when I ran headfirst into this. There is a problem with V3 mounts (the default now) where you can "mv" a file on one client, and another client never sees the change. This is particularly distressing when you "mv" the access_log file from a web server (from another amchine), kick the server to re-create the access_log file, and then find that it never shows up on the other syste (or does with zero length, but no data in it -- ever). If you look on the other system, a "ls" doesn't show the errant file. But a "cat" does -- the data is still there. Needless to say this is pretty troublesome, and leads to lots of head-scratching. I thought that perhaps the V3 code was bolluxed, so I rebuilt the kernel to the most current revision and then tried to run with V2mounts ("-2" switch in the /etc/fstab file). That's when I ran into the trouble; removing the "-2" switch had no effect -- dashing my first hypothesis that the V2 code had been broken for some time and I hadn't noticed it. I'll see what else I can find out. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/~karl | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex modem support is now available Voice: [+1 312 803-MCS1 x219]| 56kbps DIGITAL ISDN DOV on analog lines! Fax: [+1 312 803-4929] | 2 FULL DS-3 Internet links; 400Mbps B/W Internal
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19970927150755.39452>