Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 20 Feb 1999 12:17:10 +0000 (GMT)
From:      Doug Rabson <dfr@nlsystems.com>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Panic in FFS/4.0 as of yesterday
Message-ID:  <Pine.BSF.4.05.9902201158300.82049-100000@herring.nlsystems.com>
In-Reply-To: <199902190915.BAA31066@apollo.backplane.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 19 Feb 1999, Matthew Dillon wrote:

> :On Thu, 18 Feb 1999, Matthew Jacob wrote:
> : 
> :> Oh, btw- I should clarify a little about this test and some spice to the
> :> mix... The same test on a system that is 25% of the cpu power and 25% of
> :> the memory running solaris 2.7 Intel not only successfully has always runs
> :> this test but also retains a quite acceptable responsiveness. Please don't
> :> make me claim Slowlaris is better!
> :
> :I'm sure that something very wrong is happening, don't worry. Hopefully, I
> :will be able to see something.
> :
> :--
> :Doug Rabson				Mail:  dfr@nlsystems.com
> :Nonlinear Systems Ltd.			Phone: +44 181 442 9037
> 
>     I've started testing the VN device.  So far I've found it to be
>     extremely unstable when using an NFSV2 or NFSV3 file as backing
>     store.  I'm going to try using an MFS based file as backing store
>     next to see whether the problem is with the VN device or the NFS device.
> 
>     I've gotten the bmsafemap softupdates panic with softupdates mounted
>     filesystems sitting on top of VN, but that was with the NFS-backed VN
>     test which was unstable even without softupdates so I don't know if
>     that is a real crash.
> 
>     I haven't tried reproducing the softupdates panic on its own merits
>     yet.  I want to fix VN first.

I've just been looking at the responsiveness problem associated with Matt
Jacob's bulk writing test and I can see what is happening (although I'm
not sure what to do about it).

The system is unresponsive because the root inode is locked virtually all
of the time and this is because of a lock cascade leading to a single
process which is trying to rewrite a block of the directory which the test
is running in (synchronously since the fs is not using softupdates). That
process is waiting for its i/o to complete before unlocking the directory.
Unfortunately the buffer is the last on the drive's buffer queue and there
are 647 (for one instance which I examined in the debugger) buffers ahead
of it, most of which are writing about 8k. About 4Mb of buffers on the
queue are from a *single* process which seems extreme.

The i/o for directories are being hugely delayed by the several bulk
writing threads which the test has managed to start up and any directory
which stays locked for long can easily lead to a locked root vnode
(especially since there is a herd of processes in the test trying to
create files in the same directory).

I have modified my source tree to use bufq_insert_tail instead of
bufqdisksort in scsi_da.c which didn't make any difference to the
responsiveness problem (it probably made it worse since it would guarantee
that the directory i/o is delayed by the maximum amount of time).

It seems to me that there should be a mechanism to prevent the queued i/o
lists from becoming so long (over 5Mb is queued on the machine which I
have in the debugger), perhaps by throttling the writers if they start too
much asynchronous i/o.  I wonder if this can be treated as a similar
problem to the swapper latency issues which John Dyson was talking about.

I haven't seen the panic which Matt reported yet but I imagine that its an
overload condition caused by the extreme amounts of pending i/o.

--
Doug Rabson				Mail:  dfr@nlsystems.com
Nonlinear Systems Ltd.			Phone: +44 181 442 9037




To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.05.9902201158300.82049-100000>