Date: Wed, 27 Nov 1996 17:34:22 -0600 (CST) From: Karl Denninger <karl@Mcs.Net> To: current@freebsd.org Subject: Odd problem with NFS getpages()? Message-ID: <199611272334.RAA19564@Jupiter.Mcs.Net>
next in thread | raw e-mail | index | archive | help
Hi folks, Following up on a conversation that I had at the FreeBSD, uh, I mean Walnut Creek booth at Comdex :-) We've identified a fairly significant problem with -current. It goes something like this: 1) Mount a directory containing executables over NFS. 2) Start one of said executables (say, NCSA httpd 1.5.2). 3) Drive the system to do dynamic paging (ie: consume more than the physical RAM so RSS < required code size) 4) Cause an error on the NFS server (ie: pull the plug/reboot, detach the network cable for a few seconds, etc). Wait until you actually GET an error (ie: "Nfs server not responding") on the client. 5) REATTACH the network cable or restart the NFS server. 6) Watch the process puke. It does NOT die -- but you get infinite numbers of "getpages" failures on the console which are retried on an every-few-second interval (these are bold messages, so they look to be coming from kernel printfs). The message says "probably hardware" (well, yes it would be if this was a physical DISK). This is the same error you get if you pull the power cable on a drive while you have active binaires coming off it (or get a sector error on a drive -- we had THAT happen to us last night, and got the same message). If you KILL the process (ie: kill -9 xxxxx) it WILL die. You're not blocked from doing that. However, the process itself never takes a signal, so it won't exit on its own. Now, try this with a few hundred copies of that process running (ie: a virtual server web machine with lots of httpds running) and you're really screwed. If you're lucky there's enough CPU left after doing all the printfs and spinning around to actually get logged in and issue either the kills or a reboot. If not, you get to hit reset. It looks like the system is not actually retrying failed page gets in this situation, and is considering the error "sticky". Since it appears to never go back to the actual NFS disk (even though the mount point has returned and is functional for subsequent invocations of the code) you're dead. Related to this is another problem where out of the blue NFS mounted executables will start dumping core with getpages errors on startup. I've seen that with emacs and pico primarily (pico in particular loads a shared library from NFS in our environment). Ideas? This looks like something that should be fairly simple to find, as its easy to reproduce. Its present in all recent versions of -current up to and including kernels built on 11-25. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - The Finest Internet Connectivity http://www.mcs.net/~karl | T1's from $600 monthly to FULL DS-3 Service | 33 Analog Prefixes, 13 ISDN, Web servers $75/mo Voice: [+1 312 803-MCS1 x219]| Email to "info@mcs.net" WWW: http://www.mcs.net/ Fax: [+1 312 248-9865] | 2 FULL DS-3 Internet links; 400Mbps B/W Internal
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199611272334.RAA19564>
