From owner-freebsd-fs@FreeBSD.ORG Thu May 13 05:50:07 2010 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A3AE1065670 for ; Thu, 13 May 2010 05:50:07 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (unknown [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 0248F8FC0C for ; Thu, 13 May 2010 05:50:07 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id o4D5o6YM011832 for ; Thu, 13 May 2010 05:50:06 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id o4D5o6iu011831; Thu, 13 May 2010 05:50:06 GMT (envelope-from gnats) Date: Thu, 13 May 2010 05:50:06 GMT Message-Id: <201005130550.o4D5o6iu011831@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: Andriy Gapon Cc: Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Andriy Gapon List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 May 2010 05:50:07 -0000 The following reply was made to PR kern/145339; it has been noted by GNATS. From: Andriy Gapon To: Alex Bakhtin Cc: bug-followup@freebsd.org, Pawel Jakub Dawidek Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool Date: Thu, 13 May 2010 08:44:52 +0300 on 04/05/2010 02:23 Alex Bakhtin said the following: > > So, I can still easily reproduce this problem on 8-STABLE. Your > simple patch helps to avoid page fault but dead-locks the system. Are > you sure that you can just return at this point? Probably it make > sense to set some error flag before return? You are correct, my simple patch is far from being correct. And properly fixing the problem is not trivial. Some issues: 1. vdev_geom_release() sets vdev_tsd to NULL before shutting down the corresponding gc_queue; because of that, bios that may later come via vdev_geom_io_intr() can not be mapped to their gc_queue and thus there is no choice but to drop them on the floor. 2. Shutdown logic in vdev_geom_worker() does not seem to be reliable. I think that vdev thread may get stuck forever if a bio happens to be on gc_queue when vdev_geom_release() is called. In that case gc_state check may be skipped and gc_queue may never be waken up again. 3. I am not sure if pending zios are taken care of when vdev_geom_release() is called. If not, then they may get stuck forever. Hopefully Pawel can help us here. > 2010/4/23 Andriy Gapon : >> Can you try this patch? >> >> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c >> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c >> @@ -603,6 +603,9 @@ vdev_geom_io_intr(struct bio *bp) >> zio = bp->bio_caller1; >> ctx = zio->io_vd->vdev_tsd; >> >> + if (ctx == NULL) >> + return; >> + >> if ((zio->io_error = bp->bio_error) == 0 && bp->bio_resid != 0) >> zio->io_error = EIO; >> >> >> -- >> Andriy Gapon >> -- Andriy Gapon