Date: Mon, 3 May 2010 23:30:03 GMT From: Alex Bakhtin <alex.bakhtin@gmail.com> To: freebsd-fs@FreeBSD.org Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool Message-ID: <201005032330.o43NU3GM020629@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/145339; it has been noted by GNATS. From: Alex Bakhtin <alex.bakhtin@gmail.com> To: Andriy Gapon <avg@icyb.net.ua> Cc: bug-followup@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org> Subject: Re: kern/145339: [zfs] deadlock after detaching block device from raidz pool Date: Tue, 4 May 2010 03:23:35 +0400 Andriy, Upgraded to today's stable. Reproduced the problem. On GENERIC the system just hangs with the following output: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - WRITE_DMA48 status=3D7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=3D0 LBA=3D2312588250^ Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 01 fault virtual address =3D 0x48 fault code =3D supervisor write data, page not present instruction pointer =3D 0x20:0xffffffff80593e95 stack pointer =3D 0x28:0xffffff8000065ba0 frame pointer =3D 0x28:0xffffff8000065bb0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 3 (g_up) trap number =3D 12 panic: page fault cpuid =3D 1 Fatal trap 12: page fault while in kernel mode cpuid =3D 0; apic id =3D 00 fault virtual address =3D 0x0 fault code =3D supervisor read data, page not present instruction pointer =3D 0x20:0xffffffff80545a28 stack pointer =3D 0x28:0xffffff80eada2a40 frame pointer =3D 0x28:0xffffff80eada2a90 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 0 (spa_zio) trap number =3D 12 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D With GENERIG + DDB/KDB enabled I got the following (it seems that first time I detached the device when there was no active transaction - can try to reproduce): =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - device detached Fatal trap 12: page fault while in kernel mode cpuid =3D 1; apic id =3D 01 fault virtual address =3D 0x48 fault code =3D supervisor write data, page not present instruction pointer =3D 0x20:0xffffffff805a0345 Fatal double fault stack pointer =3D 0x28:0xffffff800006aba0 rip =3D 0xffffffff808085ad frame pointer =3D 0x28:0xffffff800006abb0 rsp =3D 0xffffff80ead87000 code segment =3D base 0x0, limit 0xfffff, type 0x1b rbp =3D 0xffffff80ead87070 =3D DPL 0, pres 1, long 1, def32 0, gran 1 cpuid =3D 0; processor eflags =3D apic id =3D 00 interrupt enabled, panic: double fault resume, cpuid =3D 0 IOPL =3D 0 KDB: enter: panic c[thread pid 0 tid 100113 ] Stopped at kdb_enter+0x3d: movq $0,0x69cee0(%rip) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D And another one =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D ad12: FAILURE - WRITE_DMA status=3D7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=3D0 LBA=3D111033498^M ^M ^M Fatal trap 12: page fault while in kernel mode^M cpuid =3D 1; apic id =3D 01^M fault virtual address =3D 0x48^M fault code =3D supervisor write data, page not present^M instruction pointer =3D 0x20:0xffffffff805a0345^M stack pointer =3D 0x28:0xffffff800006aba0^M frame pointer =3D 0x28:0xffffff800006abb0^M code segment =3D base 0x0, limit 0xfffff, type 0x1b^M =3D DPL 0, pres 1, long 1, def32 0, gran 1^M processor eflags =3D interrupt enabled, resume, IOPL =3D 0^M current process =3D 3 (g_up)^M [thread pid 3 tid 100011 ] Stopped at _mtx_lock_flags+0x15: lock cmpxchgq %rsi,0x18(%rdi) db:0:kdb.enter.default> capture on =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D And with your patch the system doesn't detect that device is detached and seems to be dead-locked (doesn't respond to power-button): =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D acpi0: suspend request ignored (not ready yet) acpi0: request to enter state S5 failed (err 6) acpi0: suspend request ignored (not ready yet) acpi0: request to enter state S5 failed (err 6) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D So, I can still easily reproduce this problem on 8-STABLE. Your simple patch helps to avoid page fault but dead-locks the system. Are you sure that you can just return at this point? Probably it make sense to set some error flag before return? Alex Bakhtin 2010/4/23 Andriy Gapon <avg@icyb.net.ua>: > > Can you try this patch? > > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c > @@ -603,6 +603,9 @@ vdev_geom_io_intr(struct bio *bp) > =A0 =A0 =A0 =A0zio =3D bp->bio_caller1; > =A0 =A0 =A0 =A0ctx =3D zio->io_vd->vdev_tsd; > > + =A0 =A0 =A0 if (ctx =3D=3D NULL) > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return; > + > =A0 =A0 =A0 =A0if ((zio->io_error =3D bp->bio_error) =3D=3D 0 && bp->bio_= resid !=3D 0) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zio->io_error =3D EIO; > > > -- > Andriy Gapon >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201005032330.o43NU3GM020629>