From owner-freebsd-current  Thu Apr 10 04:25:54 1997
Return-Path: <owner-current>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id EAA29827
          for current-outgoing; Thu, 10 Apr 1997 04:25:54 -0700 (PDT)
Received: from bunyip.cc.uq.edu.au (daemon@bunyip.cc.uq.edu.au [130.102.2.1])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id EAA29821
          for <freebsd-current@freebsd.org>; Thu, 10 Apr 1997 04:25:47 -0700 (PDT)
Received: (from daemon@localhost)
	by bunyip.cc.uq.edu.au (8.8.5/8.8.5) id VAA09335
	for freebsd-current@freebsd.org; Thu, 10 Apr 1997 21:25:43 +1000
Received: by ogre.dtir.qld.gov.au (8.7.5/DEVETIR-E0.3a)
	id VAA04747; Thu, 10 Apr 1997 21:16:41 +1000 (EST)
Date: Thu, 10 Apr 1997 21:16:41 +1000 (EST)
From: Stephen McKay <syssgm@dtir.qld.gov.au>
Message-Id: <199704101116.VAA04747@ogre.dtir.qld.gov.au>
To: freebsd-current@freebsd.org
cc: Stephen McKay <syssgm@dtir.qld.gov.au>
Subject: Re: Hang during NFS stress test
X-Newsreader: NN version 6.5.0 #1 (NOV)
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Stephen McKay <syssgm@dtir.qld.gov.au> wrote:

>Setup: 386DX20 with 8Mb ram running 2.2.1 (or very close) continually
>copying files from a 486DX33 running 2.1.7 back to the same mount point
>via TCP NFS.  After two days (continuous copying) it has locked up.  It
>still responds to pings, will switch virtual consoles, and I can get into
>ddb, but nothing else.
>
>Ddb shows that the machine is stuck in idle_loop(), and no processes are
>on the run queue (whichqs == 0), but ps (ddb command) shows a number of
>processes which are not waiting on anything.  For example, there are 3
>getty's on the syscons virtual screens, and only one has non-zero wchan
>(probably because I hit enter a few times on some screens to see if I
>could wake them up).
>
>The only unusual wchan is swapper waiting on swinuw (which must be from
>pmap_swapin_proc).  Other processes are in nfsidl, pause, wait, ttyin, etc.

Cpl and ipending look fine: just my console tty interrupt showing.  The
clock is still updating 'time'.

There are no processes on any run queue because only one runnable process
is in core (P_INMEM).  That process is in the process of exiting (P_WEXIT).
In fact, it seems to have got all the way through exit1() and cpu_exit()
into cpu_switch() which would have dropped us in idle() because everyone
else is asleep.  Oh, and the parent is pseudo-awake: that is it is not
waiting, but is not actually in core, so it must have been woken by the
exiting process near the end of exit1().

Process 0 (swapper) is waiting (on "swinuw", presumably in pmap_swapin_proc)
for some process's upages to unbusy.  The processes not waiting on anything
are not runnable because they are swapped out.  The swapper hasn't managed
to swap any of them in because it is stuck.  Which process has marked that
upage busy?  No idea.

So, what went wrong?  Not a clue.  This is Hard Stuff(tm) and I need some
help here.  I can keep this hung machine hung for another day at least,
but can't guarantee any more.  And it writes bad core dumps.  Sigh.

Unfortunately the serious VM folks might be somewhat disinterested because
manipulations of upages have changed in -current (in a broad way I haven't
examined yet), and presumably any bugs would have moved or mutated.

Any pointers or DDB tips gratefully accepted.

Stephen.