From owner-freebsd-fs@FreeBSD.ORG Mon May 3 23:30:04 2010 Return-Path: Delivered-To: freebsd-fs@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 53C6A106564A for ; Mon, 3 May 2010 23:30:04 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [69.147.83.40]) by mx1.freebsd.org (Postfix) with ESMTP id 2663E8FC12 for ; Mon, 3 May 2010 23:30:04 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id o43NU36P020634 for ; Mon, 3 May 2010 23:30:03 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id o43NU3GM020629; Mon, 3 May 2010 23:30:03 GMT (envelope-from gnats) Date: Mon, 3 May 2010 23:30:03 GMT Message-Id: <201005032330.o43NU3GM020629@freefall.freebsd.org> To: freebsd-fs@FreeBSD.org From: Alex Bakhtin Cc: Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Alex Bakhtin List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 May 2010 23:30:04 -0000 The following reply was made to PR kern/145339; it has been noted by GNATS. From: Alex Bakhtin To: Andriy Gapon Cc: bug-followup@freebsd.org, Pawel Jakub Dawidek Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool Date: Tue, 4 May 2010 03:23:35 +0400 Andriy, Upgraded to today's stable. Reproduced the problem. On GENERIC the system just hangs with the following output: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - WRITE_DMA48 status=3D7f error=3D0 LBA=3D2312588250^ Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 01 fault virtual address =3D 0x48 fault code =3D supervisor write data, page not present instruction pointer =3D 0x20:0xffffffff80593e95 stack pointer =3D 0x28:0xffffff8000065ba0 frame pointer =3D 0x28:0xffffff8000065bb0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 3 (g_up) trap number =3D 12 panic: page fault cpuid =3D 1 Fatal trap 12: page fault while in kernel mode cpuid =3D 0; apic id =3D 00 fault virtual address =3D 0x0 fault code =3D supervisor read data, page not present instruction pointer =3D 0x20:0xffffffff80545a28 stack pointer =3D 0x28:0xffffff80eada2a40 frame pointer =3D 0x28:0xffffff80eada2a90 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 0 (spa_zio) trap number =3D 12 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D With GENERIG + DDB/KDB enabled I got the following (it seems that first time I detached the device when there was no active transaction - can try to reproduce): =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - device detached Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 01 fault virtual address =3D 0x48 fault code =3D supervisor write data, page not present instruction pointer =3D 0x20:0xffffffff805a0345 Fatal double fault stack pointer =3D 0x28:0xffffff800006aba0 rip =3D 0xffffffff808085ad frame pointer =3D 0x28:0xffffff800006abb0 rsp =3D 0xffffff80ead87000 code segment =3D base 0x0, limit 0xfffff, type 0x1b rbp =3D 0xffffff80ead87070 =3D DPL 0, pres 1, long 1, def32 0, gran 1 cpuid =3D 0; processor eflags =3D apic id =3D 00 interrupt enabled, panic: double fault resume, cpuid =3D 0 IOPL =3D 0 KDB: enter: panic c[thread pid 0 tid 100113 ] Stopped at kdb_enter+0x3d: movq $0,0x69cee0(%rip) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D And another one =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - WRITE_DMA status=3D7f error=3D0 LBA=3D111033498^M ^M ^M Fatal trap 12: page fault while in kernel mode^M cpuid =3D 1; apic id =3D 01^M fault virtual address =3D 0x48^M fault code =3D supervisor write data, page not present^M instruction pointer =3D 0x20:0xffffffff805a0345^M stack pointer =3D 0x28:0xffffff800006aba0^M frame pointer =3D 0x28:0xffffff800006abb0^M code segment =3D base 0x0, limit 0xfffff, type 0x1b^M =3D DPL 0, pres 1, long 1, def32 0, gran 1^M processor eflags =3D interrupt enabled, resume, IOPL =3D 0^M current process =3D 3 (g_up)^M [thread pid 3 tid 100011 ] Stopped at _mtx_lock_flags+0x15: lock cmpxchgq %rsi,0x18(%rdi) db:0:kdb.enter.default> capture on =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D And with your patch the system doesn't detect that device is detached and seems to be dead-locked (doesn't respond to power-button): =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D acpi0: suspend request ignored (not ready yet) acpi0: request to enter state S5 failed (err 6) acpi0: suspend request ignored (not ready yet) acpi0: request to enter state S5 failed (err 6) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D So, I can still easily reproduce this problem on 8-STABLE. Your simple patch helps to avoid page fault but dead-locks the system. Are you sure that you can just return at this point? Probably it make sense to set some error flag before return? Alex Bakhtin 2010/4/23 Andriy Gapon : > > Can you try this patch? > > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > @@ -603,6 +603,9 @@ vdev_geom_io_intr(struct bio *bp) > =A0 =A0 =A0 =A0zio =3D bp->bio_caller1; > =A0 =A0 =A0 =A0ctx =3D zio->io_vd->vdev_tsd; > > + =A0 =A0 =A0 if (ctx =3D=3D NULL) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > + > =A0 =A0 =A0 =A0if ((zio->io_error =3D bp->bio_error) =3D=3D 0 && bp->bio_= resid !=3D 0) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zio->io_error =3D EIO; > > > -- > Andriy Gapon >