From owner-freebsd-stable  Mon Apr  2  2:15:12 2001
Delivered-To: freebsd-stable@freebsd.org
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
	by hub.freebsd.org (Postfix) with ESMTP id 9D53F37B71E
	for <freebsd-stable@FreeBSD.ORG>; Mon,  2 Apr 2001 02:15:08 -0700 (PDT)
	(envelope-from bright@fw.wintelcom.net)
Received: (from bright@localhost)
	by fw.wintelcom.net (8.10.0/8.10.0) id f329En418078;
	Mon, 2 Apr 2001 02:14:49 -0700 (PDT)
Date: Mon, 2 Apr 2001 02:14:49 -0700
From: Alfred Perlstein <bright@wintelcom.net>
To: Greg Lehey <grog@lemis.com>
Cc: Andrew Gordon <arg@arg1.demon.co.uk>, freebsd-stable@FreeBSD.ORG
Subject: Re: 4.3-RC processes stuck sleeping on "inode" (?vinum) problem update
Message-ID: <20010402021449.M813@fw.wintelcom.net>
References: <Pine.BSF.4.21.0104020008080.9790-100000@server.arg.sj.co.uk> <20010402094208.D73090@wantadilla.lemis.com> <20010402182909.A75576@wantadilla.lemis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
In-Reply-To: <20010402182909.A75576@wantadilla.lemis.com>; from grog@lemis.com on Mon, Apr 02, 2001 at 06:29:09PM +0930
X-all-your-base: are belong to us.
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

* Greg Lehey <grog@lemis.com> [010402 01:59] wrote:
> On Monday,  2 April 2001 at  9:42:08 +0930, Greg Lehey wrote:
> > On Monday,  2 April 2001 at  0:36:31 +0100, Andrew Gordon wrote:
> >>
> >> Further to my previous report:
> >>
> >>  - This is definitely a problem in 4.3RC: I rolled back to 31st Jan
> >>    sources (world & kernel), and the system has now been up for 36 hours
> >>    (as opposed to at most 6 hours running 4.3RC).
> >>
> >>  - New evidence makes me lean towards thinking that Vinum is responsible
> >>    (though this is by no means conclusive):
> >>
> >>     1. I had previously only had my nfsd processes getting stuck
> >>        (plus the 'reboot' process itself if I tried to reboot),
> >>        however, while doing a 'cvs checkout' onto the vinum filesystem
> >>        to build my jan31 world, the cvs process got stuck in "inode" too.
> >>
> >>     2. That same cvs checkout completed OK on a non-vinum filesystem.
> >>
> >>     3. I have just noticed in my console logs, that in the "ps"
> >>        output showing the nfsd processes stuck in "inode",
> >>        the "(syncer)" process is stuck in "vrlock" which is a
> >>        vinum wait channel.
> >
> > Hmm.  This is pretty conclusive.  It's a deadlock.
> >
> > Tor Egge reported a possible cause of this kind of deadlock.  I've
> > been testing a fix, but I'm not sure it doesn't have side effects.
> > Try this (in /usr/src/sys/dev/vinum), then rebuild the kernel module
> > (in /usr/src/sys/modules/vinum), stop and restart vinum, and see if it
> > helps:
> >
> > RCS file: /home/ncvs/src/sys/dev/vinum/vinumlock.c,v
> > retrieving revision 1.18.2.2
> > diff -w -u -r1.18.2.2 vinumlock.c
> > --- vinumlock.c 2001/03/13 02:59:43     1.18.2.2
> > +++ vinumlock.c 2001/04/02 00:09:53
> > @@ -169,7 +169,7 @@
> >  #endif
> >                     plex->lockwaits++;                      /* waited one more time */
> >                     tsleep(lock, PRIBIO, "vrlock", 0);
> > -                   lock = plex->lock;                      /* start again */
> > +                   lock = &plex->lock[-1];                 /* start again */
> >                     foundlocks = 0;
> >                     pos = NULL;
> >                 }
> 
> OK.  I've tried this change, and indeed I still ended up with
> problems.  It seems that from time to time a wakeup gets lost, causing
> things to hang.  I've now made a workaround, and things seem to be
> working stably.  Try this fix instead (or apply the other line if
> you've already made a change).  I'm relatively confident that this
> will fix the problem.  In view of the code freeze, please let me know
> as soon as possible whether this fixes your problem.
> 
> RCS file: /home/ncvs/src/sys/dev/vinum/vinumlock.c,v
> retrieving revision 1.18.2.2
> diff -w -u -r1.18.2.2 vinumlock.c
> --- vinumlock.c 2001/03/13 02:59:43     1.18.2.2
> +++ vinumlock.c 2001/04/02 08:56:26
> @@ -168,8 +168,8 @@
>                     }
>  #endif
>                     plex->lockwaits++;                      /* waited one more time */
> -                   tsleep(lock, PRIBIO, "vrlock", 0);
> -                   lock = plex->lock;                      /* start again */
> +                   tsleep(lock, PRIBIO, "vrlock", hz);
> +                   lock = &plex->lock [-1];                /* start again */
>                     foundlocks = 0;
>                     pos = NULL;
>                 }

Err, if you're going to commit this, it needs a detailed XXX
comment, perhaps pointing to this thread until it's fixed.

As far as a fix, a couple of suggestions:

1) I think unlockrange might require an splbio to protect the
   lock-> data as well as the plex-> data.
2) I think you may want to be using wakeup, not wakeup_one(),
   although doing that may really be obscuring the problem rather
   than solving it.


-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Represent yourself, show up at BABUG http://www.babug.org/

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message