From owner-freebsd-hackers Wed Oct 15 05:06:16 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id FAA04472 for hackers-outgoing; Wed, 15 Oct 1997 05:06:16 -0700 (PDT) (envelope-from owner-freebsd-hackers) Received: from word.smith.net.au (ppp20.portal.net.au [202.12.71.120]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id FAA04244 for ; Wed, 15 Oct 1997 05:05:06 -0700 (PDT) (envelope-from mike@word.smith.net.au) Received: from word.smith.net.au (localhost [127.0.0.1]) by word.smith.net.au (8.8.7/8.8.5) with ESMTP id VAA00264 for ; Wed, 15 Oct 1997 21:32:07 +0930 (CST) Message-Id: <199710151202.VAA00264@word.smith.net.au> X-Mailer: exmh version 2.0zeta 7/24/97 To: hackers@freebsd.org Subject: Odd out-of-swap condition; ideas? Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Wed, 15 Oct 1997 21:32:03 +0930 From: Mike Smith Sender: owner-freebsd-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk ** Firstly, please note that this is on a 2.2 of around February vintage; ** if this is known-and-fixed, say no more than that and we will proceed ** to negotiating an upgrade. We have a system in the field that is showing an odd out-of-swap condition. What's most odd is that it appears to involve a leak of some sort, where swap remains attached to a process even though the process doesn't appear to require it. Some background for the following: - The 'idl' processes are running under the Linux ABI emulation. These suckers do *lots* of filesystem work; the 'temp' allocation class gets at least twice as much work as any other in the system. - The 'exptd' process hits the hardware directly (it has IOPL set). - Both of the above are started using 'su' out of system startup scripts, so they inherit either the daemon or default resource limits, in this case they should be limited to 64M max size. - All of the 'ps' output is from 'ps alxmwww', trimmed to keep the most interesting fields and processes. Here is some relevant data shortly after startup: Tue Oct 14 06:32:01 GMT 1997 Device 1K-blocks Used Avail Capacity Type /dev/sd0s1b 131072 5228 125780 4% Interleaved PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND -6 0 4884 916 biowai D con- 26:10.80 .../bin.linux/idl analysis_init 69 0 4476 2196 - R ?? 0:00.72 .../bin.linux/idl display_init 2 0 3684 3476 select Ss ?? 0:47.29 /usr/X11R6/bin/X -auth ... 18 0 2592 996 pause S ?? 4:50.06 exptd: experiment ... All looks pretty happy. After a little while the display gets some more work done, and grows a bit: Tue Oct 14 19:38:46 GMT 1997 Device 1K-blocks Used Avail Capacity Type /dev/sd0s1b 131072 64096 66912 49% Interleaved PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND -6 0 30560 4152 biowai D ?? 54:18.29 .../bin.linux/idl display_init 10 0 4884 1212 wait S con- 205:33.58 .../bin.linux/idl analysis_init 2 0 3708 2260 select Ss ?? 3:21.22 /usr/X11R6/bin/X -auth ... -6 0 2592 760 biowai D ?? 153:04.58 exptd: experiment: ... Ok, that's not unreasonable, but note the amount of swap in use; it's starting to look a bit suspicious. There's nothing like that much in toto in the VSZ column. A little bit later we see: Tue Oct 14 21:13:42 GMT 1997 Device 1K-blocks Used Avail Capacity Type /dev/sd0s1b 131072 128116 2892 98% Interleaved PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND 2 0 39220 4132 select S ?? 61:35.05 .../bin.linux/idl display_init 18 0 4908 1020 pause S con- 226:26.83 .../bin.linux/idl analysis_init 2 0 3708 2444 select Ss ?? 3:35.28 /usr/X11R6/bin/X -auth ... 74 0 2592 1916 - R ?? 170:51.50 exptd: experiment: ... Whoa, where'd it all go? Next pass (10 seconds later) ps died because it couldn't allocate any memory. At this point, various things were failing (normally lots of fork/exec activity), but it struggled along. The analysis died eventually, which let a single pass run: Tue Oct 14 21:18:24 GMT 1997 Device 1K-blocks Used Avail Capacity Type /dev/sd0s1b 131072 130212 796 99% Interleaved PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND -6 0 39236 5188 biowai D ?? 61:57.12 .../bin.linux/idl display_init 2 0 3708 2388 select Ss ?? 3:36.05 /usr/X11R6/bin/X -auth ... Note that just about everything else is gone, and still no swap left. Then, eventually the display dies too, and immediately all is well again: Device 1K-blocks Used Avail Capacity Type /dev/sd0s1b 131072 9696 121312 7% Interleaved PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND -6 0 4240 836 biowai D ?? 0:00.23 .../bin.linux/idl analysis_init 2 0 3708 2436 select Ss ?? 3:38.57 /usr/X11R6/bin/X -auth ... -6 0 944 464 biowai D ?? 0:00.02 exptd: experiment: ... (The analysis and experiment were resurrected by their startup scripts) The conclusion reached from this is that the display process has somehow managed to own a lot of swap that wasn't attached to it. Any ideas? Suggested explanations? Upgrading this system will be a little difficult (it is in remote eastern Germany), but will be undertaken if a fix is likely. Thanks, mike