Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Nov 1996 17:34:22 -0600 (CST)
From:      Karl Denninger  <karl@Mcs.Net>
To:        current@freebsd.org
Subject:   Odd problem with NFS getpages()?
Message-ID:  <199611272334.RAA19564@Jupiter.Mcs.Net>

next in thread | raw e-mail | index | archive | help
Hi folks,

Following up on a conversation that I had at the FreeBSD, uh, I mean Walnut
Creek booth at Comdex :-)

We've identified a fairly significant problem with -current.  It goes
something like this:

1)	Mount a directory containing executables over NFS.
2)	Start one of said executables (say, NCSA httpd 1.5.2).
3)	Drive the system to do dynamic paging (ie: consume more than the
	physical RAM so RSS < required code size)
4)	Cause an error on the NFS server (ie: pull the plug/reboot, 
	detach the network cable for a few seconds, etc).  Wait until
	you actually GET an error (ie: "Nfs server not responding") on the
	client.

5)	REATTACH the network cable or restart the NFS server.

6)	Watch the process puke.  It does NOT die -- but you get infinite
	numbers of "getpages" failures on the console which are retried
	on an every-few-second interval (these are bold messages, so they
	look to be coming from kernel printfs).  The message says "probably
	hardware" (well, yes it would be if this was a physical DISK).  This
	is the same error you get if you pull the power cable on a drive while 
	you have active binaires coming off it (or get a sector error on a 
	drive -- we had THAT happen to us last night, and got the same
	message).

If you KILL the process (ie: kill -9 xxxxx) it WILL die.  You're not blocked
from doing that.  However, the process itself never takes a signal, so it
won't exit on its own.

Now, try this with a few hundred copies of that process running (ie: a
virtual server web machine with lots of httpds running) and you're really
screwed.  If you're lucky there's enough CPU left after doing all the
printfs and spinning around to actually get logged in and issue either 
the kills or a reboot.

If not, you get to hit reset.

It looks like the system is not actually retrying failed page gets in 
this situation, and is considering the error "sticky".  Since it appears 
to never go back to the actual NFS disk (even though the mount point has
returned and is functional for subsequent invocations of the code) you're
dead.

Related to this is another problem where out of the blue NFS mounted
executables will start dumping core with getpages errors on startup.  I've
seen that with emacs and pico primarily (pico in particular loads a shared
library from NFS in our environment).

Ideas?  This looks like something that should be fairly simple to find, as
its easy to reproduce.  Its present in all recent versions of -current up 
to and including kernels built on 11-25.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - The Finest Internet Connectivity
http://www.mcs.net/~karl     | T1's from $600 monthly to FULL DS-3 Service
			     | 33 Analog Prefixes, 13 ISDN, Web servers $75/mo
Voice: [+1 312 803-MCS1 x219]| Email to "info@mcs.net" WWW: http://www.mcs.net/
Fax:   [+1 312 248-9865]     | 2 FULL DS-3 Internet links; 400Mbps B/W Internal



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199611272334.RAA19564>