Date: Tue, 25 Jul 2017 00:39:09 +0700 From: Eugene Grosbein <eugen@grosbein.net> To: FreeBSD Stable <freebsd-stable@FreeBSD.org> Subject: Re: stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock Message-ID: <597630BD.8010605@grosbein.net> In-Reply-To: <59749E58.2060408@grosbein.net> References: <59749E58.2060408@grosbein.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 23.07.2017 20:02, Eugene Grosbein wrote: > Hi! > > Long story short: stable/11 r321371 started to panic at the moment of smartd invocation > after my SSD died. > > I have Intel motherboard with graid-supported pseudo-raid. > I use it in RAID1 mode with one HDD and one SSD. > > Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel (timeouts). > This went unnoticed by me as graid just disconnected it on-the-fly: > > kernel: ahcich5: Timeout on slot 24 port 0 > kernel: ahcich5: is 00000000 cs 00000000 ss 01000000 rs 01000000 tfd 40 serr 00000000 cmd 0000d817 > kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 > kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout > kernel: (ada1:ahcich5:0:0:0): Retrying command > kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) > [skip] > kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0 > kernel: ada1: <LITEON CV1-MB512-11 2803> s/n JYKJ550855860139 detached > [skip] > kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 00 00 00 00 00 > kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout > kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated > kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=269389066240, length=32768)] > kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to FAILED. > kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from ACTIVE to FAILED. > kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to DEGRADED. > kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to OFFLINE. > kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from FAILED to NONE. > kernel: GEOM_RAID: Write failed: failing subdisk. ada1[WRITE(offset=270699851776, length=32768)] > kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong state (OFFLINE)! > > Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE r318692 > to stable/11 r321371 that went smooth. After reboot, BIOS was unable to detect SSD, > reported degraded state of the mirror and booted FreeBSD using second mirror component (HDD). > > After long timeout, the kernel could not detect dead SSD too and continued to run with degraded mirror > just fine: the system went multiuser mode and almost finished loading when rcNG started smartd. > The kernel panices that moment. This is repeatable: I can cold-boot to single user mode, start smartd > and get same panic. This is debugging kernel and I managed to obtain crashdump. > > kgdb session follows: > > <118>Starting smartd. > Kernel page fault with the following non-sleepable locks held: > exclusive sleep mutex CAM device lock (CAM device lock) r = 0 (0xfffff8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766 > stack backtrace: > #0 0xffffffff80a12620 at witness_debugger+0x70 > #1 0xffffffff80a13a4e at witness_warn+0x45e > #2 0xffffffff80e4b693 at trap_pfault+0x53 > #3 0xffffffff80e4ae3e at trap+0x29e > #4 0xffffffff80e2ed91 at calltrap+0x8 > #5 0xffffffff8033873a at passsendccb+0x6a > #6 0xffffffff80337836 at passdoioctl+0x3c6 > #7 0xffffffff80337052 at passioctl+0x22 > #8 0xffffffff80878c78 at devfs_ioctl_f+0x138 > #9 0xffffffff80a18184 at kern_ioctl+0x2c4 > #10 0xffffffff80a17e4f at sys_ioctl+0x16f > #11 0xffffffff80e4c05a at amd64_syscall+0x53a > #12 0xffffffff80e2f07b at Xfast_syscall+0xfb > > Fatal trap 12: page fault while in kernel mode > cpuid = 0; apic id = 00 > fault virtual address = 0xa > fault code = supervisor read data, page not present > instruction pointer = 0x20:0xffffffff80e494e1 > stack pointer = 0x28:0xfffffe04675ff670 > frame pointer = 0x28:0xfffffe04675ff670 > code segment = base 0x0, limit 0xfffff, type 0x1b > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1387 (smartd) > trap number = 12 > panic: page fault > cpuid = 0 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe04675ff250 > vpanic() at vpanic+0x186/frame 0xfffffe04675ff2d0 > panic() at panic+0x43/frame 0xfffffe04675ff330 > trap_fatal() at trap_fatal+0x322/frame 0xfffffe04675ff380 > trap_pfault() at trap_pfault+0x62/frame 0xfffffe04675ff3e0 > trap() at trap+0x29e/frame 0xfffffe04675ff5a0 > calltrap() at calltrap+0x8/frame 0xfffffe04675ff5a0 > --- trap 0xc, rip = 0xffffffff80e494e1, rsp = 0xfffffe04675ff670, rbp = 0xfffffe04675ff670 --- > copyin() at copyin+0x41/frame 0xfffffe04675ff670 > passsendccb() at passsendccb+0x6a/frame 0xfffffe04675ff6f0 > passdoioctl() at passdoioctl+0x3c6/frame 0xfffffe04675ff7a0 > passioctl() at passioctl+0x22/frame 0xfffffe04675ff7e0 > devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfffffe04675ff840 > kern_ioctl() at kern_ioctl+0x2c4/frame 0xfffffe04675ff8a0 > sys_ioctl() at sys_ioctl+0x16f/frame 0xfffffe04675ff980 > amd64_syscall() at amd64_syscall+0x53a/frame 0xfffffe04675ffab0 > Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe04675ffab0 > --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80174c18a, rsp = 0x7fffffff0308, rbp = 0x7fffffff0900 --- > Uptime: 2m46s > Dumping 723 out of 16285 MB:..3%..12%..23%..31%..43%..51%..62%..71%..82%..91% > > (kgdb) bt > #0 doadump (textdump=1) at /home/src/sys/kern/kern_shutdown.c:299 > #1 0xffffffff809b1ab1 in kern_reboot (howto=260) at /home/src/sys/kern/kern_shutdown.c:376 > #2 0xffffffff809b2100 in vpanic (fmt=<value optimized out>, ap=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:770 > #3 0xffffffff809b2143 in panic (fmt=<value optimized out>) at /home/src/sys/kern/kern_shutdown.c:701 > #4 0xffffffff80e4b632 in trap_fatal (frame=0xfffffe04675ff5b0, eva=10) at /home/src/sys/amd64/amd64/trap.c:801 > #5 0xffffffff80e4b6a2 in trap_pfault (frame=0xfffffe04675ff5b0, usermode=0) at pcpu.h:222 > #6 0xffffffff80e4ae3e in trap (frame=0xfffffe04675ff5b0) at /home/src/sys/amd64/amd64/trap.c:421 > #7 0xffffffff80e2ed91 in calltrap () at /home/src/sys/amd64/amd64/exception.S:236 > #8 0xffffffff80e494e1 in copyin () at /home/src/sys/amd64/amd64/support.S:304 > #9 0xffffffff8033873a in passsendccb (periph=0xfffff8000cfe5e00, ccb=0xfffff800662f9000, inccb=0xfffff80066d91000) at /home/src/sys/cam/scsi/scsi_pass.c:2172 > #10 0xffffffff80337836 in passdoioctl (dev=<value optimized out>, cmd=<value optimized out>, addr=0xfffff80066d91000 "", flag=<value optimized out>, > td=<value optimized out>) at /home/src/sys/cam/scsi/scsi_pass.c:1823 > #11 0xffffffff80337052 in passioctl (dev=0xfffff8000cfc4c00, cmd=3303020802, addr=0xfffff80066d91000 "", flag=3, td=0xfffff8000e1a3000) > at /home/src/sys/cam/scsi/scsi_pass.c:1751 > #12 0xffffffff80878c78 in devfs_ioctl_f (fp=0xfffff8000e50d3c0, com=3303020802, data=0xfffff80066d91000, cred=0xfffff800066c7200, td=0xfffff8000e1a3000) > at /home/src/sys/fs/devfs/devfs_vnops.c:791 > #13 0xffffffff80a18184 in kern_ioctl (td=<value optimized out>, fd=<value optimized out>, com=<value optimized out>, data=<value optimized out>) at file.h:323 > #14 0xffffffff80a17e4f in sys_ioctl (td=<value optimized out>, uap=0xfffff8000e1a3538) at /home/src/sys/kern/sys_generic.c:745 > #15 0xffffffff80e4c05a in amd64_syscall (td=0xfffff8000e1a3000, traced=0) at subr_syscall.c:131 > #16 0xffffffff80e2f07b in Xfast_syscall () at /home/src/sys/amd64/amd64/exception.S:396 > #17 0x000000080174c18a in ?? () > Previous frame inner to this frame (corrupt stack?) > > I can keep dead SSD in place to debug this. Looking for help. I managed to obtain crashdump for the case and filled PR with details: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=220972
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?597630BD.8010605>