From owner-freebsd-stable@FreeBSD.ORG  Sat Dec 22 17:08:15 2007
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 509EF16A418;
	Sat, 22 Dec 2007 17:08:15 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au
	[211.29.132.194])
	by mx1.freebsd.org (Postfix) with ESMTP id EA4AA13C46B;
	Sat, 22 Dec 2007 17:08:14 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	(c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213])
	by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lBMH897M022272
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 23 Dec 2007 04:08:11 +1100
Date: Sun, 23 Dec 2007 04:08:09 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Kostik Belousov <kostikbel@gmail.com>
In-Reply-To: <20071222050743.GP57756@deviant.kiev.zoral.com.ua>
Message-ID: <20071223032944.G48303@delplex.bde.org>
References: <20071221234347.GS25053@tnn.dglawrence.com>
	<MDEHLPKNGKAHNMBLJOLKMEKLJAAC.davids@webmaster.com>
	<20071222050743.GP57756@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: "Freebsd-Net@Freebsd. Org" <freebsd-net@FreeBSD.org>,
	freebsd-stable@FreeBSD.org
Subject: Re: Packet loss every 30.999 seconds
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 22 Dec 2007 17:08:15 -0000

On Sat, 22 Dec 2007, Kostik Belousov wrote:

> On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:
>>
>> I'm just an observer, and I may be confused, but it seems to me that this is
>> motion in the wrong direction (at least, it's not going to fix the actual
>> problem). As I understand the problem, once you reach a certain point, the
>> system slows down *every* 30.999 seconds. Now, it's possible for the code to
>> cause one slowdown as it cleans up, but why does it need to clean up so much
>> 31 seconds later?

It is just searching for things to clean up, and doing this pessimally due
to unnecessary cache misses and (more recently) introduction of overheads
to handling the case where the mount point is locked into the fast path
where the mount point is not unlocked.

The search every 30 seconds or so is probably more efficient, and is
certainly simpler, than managing the list on every change to every vnode
for every file system.  However, it gives a high latency in non-preemptible
kernels.

>> Why not find/fix the actual bug? Then work on getting the yield right if it
>> turns out there's an actual problem for it to fix.

Yielding is probably the correct fix for non-preemptible kernels.  Some
operations just take a long time, but are low priority so they can be
preempted.  This operation is partly under user control, since any user
can call sync(2) and thus generate the latency every <latency> seconds.
But this is no worse than a user generating even larger blocks of latency
by reading huge amounts from /dev/zero.  My old latency workaround for
the latter (and other huge i/o's) is still sort of necessary, though it
now works bogusly (hogticks doesn't work since it is reset on context
switches to interrupt handlers; however, any context switch mostly fixes
the problem).  My old latency workaround only reduces the latency to a
multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow
latencies much larger than the ones that cause problems here, but its
bogus current operation tends to give latencies of more like 1/HZ which
is short enough when HZ has its default misconfiguration to 1000.

I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.

>> If the problem is that too much work is being done at a stretch and it turns
>> out this is because work is being done erroneously or needlessly, fixing
>> that should solve the whole problem. Doing the work that doesn't need to be
>> done more slowly is at best an ugly workaround.

Lots of necessary work is being done.

> Yes, rewriting the syncer is the right solution. It probably cannot be done
> quickly enough. If the yield workaround provide mitigation for now, it
> shall go in.

I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:

% ./ufs/ffs/ffs_snapshot.c:	MNT_VNODE_FOREACH(xvp, mp, mvp) {
% ./ufs/ffs/ffs_snapshot.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./fs/msdosfs/msdosfs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./fs/coda/coda_subr.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_default.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfs4client/nfs4_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfsclient/nfs_subs.c:	MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./nfsclient/nfs_vfsops.c:	MNT_VNODE_FOREACH(vp, mp, mvp) {

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.

Bruce