From owner-freebsd-stable Mon Apr 2 2:15:12 2001 Delivered-To: freebsd-stable@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id 9D53F37B71E for ; Mon, 2 Apr 2001 02:15:08 -0700 (PDT) (envelope-from bright@fw.wintelcom.net) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id f329En418078; Mon, 2 Apr 2001 02:14:49 -0700 (PDT) Date: Mon, 2 Apr 2001 02:14:49 -0700 From: Alfred Perlstein To: Greg Lehey Cc: Andrew Gordon , freebsd-stable@FreeBSD.ORG Subject: Re: 4.3-RC processes stuck sleeping on "inode" (?vinum) problem update Message-ID: <20010402021449.M813@fw.wintelcom.net> References: <20010402094208.D73090@wantadilla.lemis.com> <20010402182909.A75576@wantadilla.lemis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <20010402182909.A75576@wantadilla.lemis.com>; from grog@lemis.com on Mon, Apr 02, 2001 at 06:29:09PM +0930 X-all-your-base: are belong to us. Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG * Greg Lehey [010402 01:59] wrote: > On Monday, 2 April 2001 at 9:42:08 +0930, Greg Lehey wrote: > > On Monday, 2 April 2001 at 0:36:31 +0100, Andrew Gordon wrote: > >> > >> Further to my previous report: > >> > >> - This is definitely a problem in 4.3RC: I rolled back to 31st Jan > >> sources (world & kernel), and the system has now been up for 36 hours > >> (as opposed to at most 6 hours running 4.3RC). > >> > >> - New evidence makes me lean towards thinking that Vinum is responsible > >> (though this is by no means conclusive): > >> > >> 1. I had previously only had my nfsd processes getting stuck > >> (plus the 'reboot' process itself if I tried to reboot), > >> however, while doing a 'cvs checkout' onto the vinum filesystem > >> to build my jan31 world, the cvs process got stuck in "inode" too. > >> > >> 2. That same cvs checkout completed OK on a non-vinum filesystem. > >> > >> 3. I have just noticed in my console logs, that in the "ps" > >> output showing the nfsd processes stuck in "inode", > >> the "(syncer)" process is stuck in "vrlock" which is a > >> vinum wait channel. > > > > Hmm. This is pretty conclusive. It's a deadlock. > > > > Tor Egge reported a possible cause of this kind of deadlock. I've > > been testing a fix, but I'm not sure it doesn't have side effects. > > Try this (in /usr/src/sys/dev/vinum), then rebuild the kernel module > > (in /usr/src/sys/modules/vinum), stop and restart vinum, and see if it > > helps: > > > > RCS file: /home/ncvs/src/sys/dev/vinum/vinumlock.c,v > > retrieving revision 1.18.2.2 > > diff -w -u -r1.18.2.2 vinumlock.c > > --- vinumlock.c 2001/03/13 02:59:43 1.18.2.2 > > +++ vinumlock.c 2001/04/02 00:09:53 > > @@ -169,7 +169,7 @@ > > #endif > > plex->lockwaits++; /* waited one more time */ > > tsleep(lock, PRIBIO, "vrlock", 0); > > - lock = plex->lock; /* start again */ > > + lock = &plex->lock[-1]; /* start again */ > > foundlocks = 0; > > pos = NULL; > > } > > OK. I've tried this change, and indeed I still ended up with > problems. It seems that from time to time a wakeup gets lost, causing > things to hang. I've now made a workaround, and things seem to be > working stably. Try this fix instead (or apply the other line if > you've already made a change). I'm relatively confident that this > will fix the problem. In view of the code freeze, please let me know > as soon as possible whether this fixes your problem. > > RCS file: /home/ncvs/src/sys/dev/vinum/vinumlock.c,v > retrieving revision 1.18.2.2 > diff -w -u -r1.18.2.2 vinumlock.c > --- vinumlock.c 2001/03/13 02:59:43 1.18.2.2 > +++ vinumlock.c 2001/04/02 08:56:26 > @@ -168,8 +168,8 @@ > } > #endif > plex->lockwaits++; /* waited one more time */ > - tsleep(lock, PRIBIO, "vrlock", 0); > - lock = plex->lock; /* start again */ > + tsleep(lock, PRIBIO, "vrlock", hz); > + lock = &plex->lock [-1]; /* start again */ > foundlocks = 0; > pos = NULL; > } Err, if you're going to commit this, it needs a detailed XXX comment, perhaps pointing to this thread until it's fixed. As far as a fix, a couple of suggestions: 1) I think unlockrange might require an splbio to protect the lock-> data as well as the plex-> data. 2) I think you may want to be using wakeup, not wakeup_one(), although doing that may really be obscuring the problem rather than solving it. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] Represent yourself, show up at BABUG http://www.babug.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message