From owner-freebsd-current Thu Apr 10 04:25:54 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id EAA29827 for current-outgoing; Thu, 10 Apr 1997 04:25:54 -0700 (PDT) Received: from bunyip.cc.uq.edu.au (daemon@bunyip.cc.uq.edu.au [130.102.2.1]) by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id EAA29821 for ; Thu, 10 Apr 1997 04:25:47 -0700 (PDT) Received: (from daemon@localhost) by bunyip.cc.uq.edu.au (8.8.5/8.8.5) id VAA09335 for freebsd-current@freebsd.org; Thu, 10 Apr 1997 21:25:43 +1000 Received: by ogre.dtir.qld.gov.au (8.7.5/DEVETIR-E0.3a) id VAA04747; Thu, 10 Apr 1997 21:16:41 +1000 (EST) Date: Thu, 10 Apr 1997 21:16:41 +1000 (EST) From: Stephen McKay Message-Id: <199704101116.VAA04747@ogre.dtir.qld.gov.au> To: freebsd-current@freebsd.org cc: Stephen McKay Subject: Re: Hang during NFS stress test X-Newsreader: NN version 6.5.0 #1 (NOV) Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Stephen McKay wrote: >Setup: 386DX20 with 8Mb ram running 2.2.1 (or very close) continually >copying files from a 486DX33 running 2.1.7 back to the same mount point >via TCP NFS. After two days (continuous copying) it has locked up. It >still responds to pings, will switch virtual consoles, and I can get into >ddb, but nothing else. > >Ddb shows that the machine is stuck in idle_loop(), and no processes are >on the run queue (whichqs == 0), but ps (ddb command) shows a number of >processes which are not waiting on anything. For example, there are 3 >getty's on the syscons virtual screens, and only one has non-zero wchan >(probably because I hit enter a few times on some screens to see if I >could wake them up). > >The only unusual wchan is swapper waiting on swinuw (which must be from >pmap_swapin_proc). Other processes are in nfsidl, pause, wait, ttyin, etc. Cpl and ipending look fine: just my console tty interrupt showing. The clock is still updating 'time'. There are no processes on any run queue because only one runnable process is in core (P_INMEM). That process is in the process of exiting (P_WEXIT). In fact, it seems to have got all the way through exit1() and cpu_exit() into cpu_switch() which would have dropped us in idle() because everyone else is asleep. Oh, and the parent is pseudo-awake: that is it is not waiting, but is not actually in core, so it must have been woken by the exiting process near the end of exit1(). Process 0 (swapper) is waiting (on "swinuw", presumably in pmap_swapin_proc) for some process's upages to unbusy. The processes not waiting on anything are not runnable because they are swapped out. The swapper hasn't managed to swap any of them in because it is stuck. Which process has marked that upage busy? No idea. So, what went wrong? Not a clue. This is Hard Stuff(tm) and I need some help here. I can keep this hung machine hung for another day at least, but can't guarantee any more. And it writes bad core dumps. Sigh. Unfortunately the serious VM folks might be somewhat disinterested because manipulations of upages have changed in -current (in a broad way I haven't examined yet), and presumably any bugs would have moved or mutated. Any pointers or DDB tips gratefully accepted. Stephen.