From owner-freebsd-stable@FreeBSD.ORG Tue Dec 16 23:27:03 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 07E54695; Tue, 16 Dec 2014 23:27:03 +0000 (UTC) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [150.101.137.143]) by mx1.freebsd.org (Postfix) with ESMTP id 5D290226; Tue, 16 Dec 2014 23:27:00 +0000 (UTC) Received: from ppp14-2-30-215.lns21.adl2.internode.on.net (HELO leader.local) ([14.2.30.215]) by ipmail05.adl6.internode.on.net with ESMTP; 17 Dec 2014 09:55:11 +1030 Message-ID: <5490BF55.3020108@ShaneWare.Biz> Date: Wed, 17 Dec 2014 09:55:09 +1030 From: Shane Ambler User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: John Baldwin , freebsd-stable@freebsd.org Subject: Re: Help debugging stable/10 References: <5488F58D.7060708@ShaneWare.Biz> <201412161129.57704.jhb@freebsd.org> In-Reply-To: <201412161129.57704.jhb@freebsd.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Alexander Motin , hselasky@freebsd.org, mjg@freebsd.org, Shane Ambler X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Dec 2014 23:27:03 -0000 On 17/12/2014 02:59, John Baldwin wrote: > On Wednesday, December 10, 2014 8:38:21 pm Shane Ambler wrote: >> Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater >> than 1 day. I have little experience debugging the OS so could use some >> help. >> >> # uname -a >> FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec >> 2 08:13:06 ACDT 2014 >> root@leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64 >> >> This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives >> in raidz. >> Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/ >> >> The thing that breaks which forces me to reset the machine is that I am >> unable to start new processes. Existing processes continue to work I >> just can't start new ones. Some simple commands do work but top ps >> procstat usbconfig all fail to start. I have been able use script to >> get some backtraces from kgdb before restarting. > > It looks like your processes are hanging on a global lock used by sysctl(3). > That would explain hangs in top/ps/procstat as they all use sysctls. For > example: > > Thread 709 (Thread 101172): > #0 sched_switch (td=0xfffff8006cccd490, newtd=, flags=) > at /usr/src-stable/sys/kern/sched_ule.c:1945 > #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 > #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 > #3 0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff8157adc0, tid=18446735279441892496, > opts=, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681 > #4 0xffffffff80936004 in userland_sysctl (td=, name=0xfffffe021e889930, namelen=3, > old=, oldlenp=, inkernel=, > new=, newlen=, retval=, flags=0) at sx.h:152 > #5 0xffffffff80935e64 in sys___sysctl (td=0xfffff8006cccd490, uap=0xfffffe021e889a40) > at /usr/src-stable/sys/kern/kern_sysctl.c:1536 > #6 0xffffffff80d29c51 in amd64_syscall (td=0xfffff8006cccd490, traced=0) at subr_syscall.c:134 > #7 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 > #8 0x0000000800b6e64a in ?? () > Previous frame inner to this frame (corrupt stack?) > ---Type to continue, or q to quit--- > > To hang like this means that some sysctl handler is running for a long time. > > Looking in your traces, this appears to be the problematic sysctl: > > Thread 397 (Thread 100422): > #0 sched_switch (td=0xfffff8001049f920, newtd=, flags=) > at /usr/src-stable/sys/kern/sched_ule.c:1945 > #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 > #2 0xffffffff8097189a in sleepq_timedwait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:652 > #3 0xffffffff8093320e in _sleep (ident=, lock=, > priority=, wmesg=, sbt=, pr=, > flags=) at /usr/src-stable/sys/kern/kern_synch.c:251 > #4 0xffffffff80891ae3 in g_waitfor_event (func=, arg=, > flag=) at /usr/src-stable/sys/geom/geom_event.c:427 > #5 0xffffffff808933d9 in sysctl_kern_geom_conftxt (oidp=, arg1=, > arg2=, req=0xfffffe021dc13868) at /usr/src-stable/sys/geom/geom_kern.c:169 > #6 0xffffffff80935a9e in sysctl_root (arg1=, arg2=) > at /usr/src-stable/sys/kern/kern_sysctl.c:1500 > #7 0xffffffff80936078 in userland_sysctl (td=, name=0xfffffe021dc13930, > namelen=, old=, oldlenp=, > inkernel=, new=, newlen=, > retval=, flags=0) at /usr/src-stable/sys/kern/kern_sysctl.c:1610 > #8 0xffffffff80935e64 in sys___sysctl (td=0xfffff8001049f920, uap=0xfffffe021dc13a40) > at /usr/src-stable/sys/kern/kern_sysctl.c:1536 > #9 0xffffffff80d29c51 in amd64_syscall (td=0xfffff8001049f920, traced=0) at subr_syscall.c:134 > #10 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 > #11 0x00000008019c864a in ?? () > Previous frame inner to this frame (corrupt stack?) > > (Note that HEAD has some changes by mjg@ (cc'd) to allow concurrent sysctls > that would at least make this less painful.) > > The root issue is why is geom hanging. Hmm, that thread is also stuck on an > sx lock: > > Thread 259 (Thread 100015): > #0 sched_switch (td=0xfffff80005170000, newtd=, flags=) > at /usr/src-stable/sys/kern/sched_ule.c:1945 > #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 > #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 > #3 0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff816451e0, tid=18446735277701922816, > opts=, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681 > #4 0xffffffff808912b2 in g_run_events () at sx.h:152 > #5 0xffffffff808fad3a in fork_exit (callout=0xffffffff80893240 , arg=0x0, frame=0xfffffe01ef98aac0) > at /usr/src-stable/sys/kern/kern_fork.c:996 > #6 0xffffffff80d0ef3e in fork_trampoline () at /usr/src-stable/sys/amd64/amd64/exception.S:606 > #7 0x0000000000000000 in ?? () > > That is probably g_topology_lock. If so, it is held by this thread > (g_dev_open() grabs it around g_access()): > > Thread 673 (Thread 101936): > #0 sched_switch (td=0xfffff80162200490, newtd=, flags=) > at /usr/src-stable/sys/kern/sched_ule.c:1945 > #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 > #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 > #3 0xffffffff80933227 in _sleep (ident=, lock=, > priority=, wmesg=, sbt=, pr=, > flags=) at /usr/src-stable/sys/kern/kern_synch.c:255 > #4 0xffffffff8030b9b3 in daopen (dp=) at /usr/src-stable/sys/cam/scsi/scsi_da.c:1295 > #5 0xffffffff8089013b in g_disk_access (pp=0xfffff800097f4c00, r=, w=, > e=) at /usr/src-stable/sys/geom/geom_disk.c:136 > #6 0xffffffff80894f3e in g_access (cp=0xfffff800097f2e00, dcr=1, dcw=0, dce=0) > at /usr/src-stable/sys/geom/geom_subr.c:913 > #7 0xffffffff8088df12 in g_dev_open (dev=, flags=, > fmt=, td=) at /usr/src-stable/sys/geom/geom_dev.c:361 > #8 0xffffffff808180f2 in devfs_open (ap=) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:1086 > ---Type to continue, or q to quit--- > #9 0xffffffff80e44fd1 in VOP_OPEN_APV (vop=, a=) at vnode_if.c:469 > #10 0xffffffff809d9bb4 in vn_open_vnode (vp=0xfffff8017d14d000, fmode=1, cred=0xfffff8000c2e4b00, > td=0xfffff80162200490, fp=0xfffff80081060230) at vnode_if.h:196 > #11 0xffffffff809d97ac in vn_open_cred (ndp=0xfffffe021e97b880, flagp=0xfffffe021e97b95c, > cmode=, vn_open_flags=, cred=0x0, fp=0xfffff80081060230) > at /usr/src-stable/sys/kern/vfs_vnops.c:256 > #12 0xffffffff809d2def in kern_openat (td=0xfffff80162200490, fd=-100, > path=0x7fffffffe99e , pathseg=UIO_USERSPACE, flags=1, > mode=) at /usr/src-stable/sys/kern/vfs_syscalls.c:1091 > #13 0xffffffff80d29c51 in amd64_syscall (td=0xfffff80162200490, traced=0) at subr_syscall.c:134 > #14 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 > #15 0x0000000800f8dcea in ?? () > Previous frame inner to this frame (corrupt stack?) > > Line 1295 of scsi_da.c in a stable/10 checkout of today is this: > > /* Wait for the disk size update. */ > error = cam_periph_sleep(periph, &softc->disk->d_mediasize, PRIBIO, > "dareprobe", 0); > > To be stuck here means that the request dareprobe() queued is stuck behind > some other request on the device. Looking in your traces, there are several > threads all waiting for an ioctl() to a passX device to complete: > > Thread 432 (Thread 100542): > #0 sched_switch (td=0xfffff801acb8a920, newtd=, flags=) > at /usr/src-stable/sys/kern/sched_ule.c:1945 > #1 0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493 > #2 0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617 > #3 0xffffffff80933227 in _sleep (ident=, lock=, > priority=, wmesg=, sbt=, pr=, > flags=) at /usr/src-stable/sys/kern/kern_synch.c:255 > ---Type to continue, or q to quit--- > #4 0xffffffff802ddc0e in cam_periph_runccb (ccb=0xfffff80187888000, error_routine=0xffffffff8030d2a0 , > camflags=CAM_RETRY_SELTO, sense_flags=18, ds=0xfffff8000984fa20) at /usr/src-stable/sys/cam/cam_periph.c:969 > #5 0xffffffff8030d233 in passdoioctl (dev=, cmd=, > addr=0xfffff800b50f4800 "", flag=, td=) > at /usr/src-stable/sys/cam/scsi/scsi_pass.c:680 > #6 0xffffffff8030cf32 in passioctl (dev=0xfffff80009835e00, cmd=3302496258, addr=0xfffff800b50f4800 "", flag=3, > td=0xfffff801acb8a920) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:533 > #7 0xffffffff80819a99 in devfs_ioctl_f (fp=0xfffff800b5d6f320, com=3302496258, data=0xfffff800b50f4800, > cred=, td=0xfffff801acb8a920) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:759 > #8 0xffffffff8097d865 in kern_ioctl (td=0xfffff801acb8a920, fd=, com=0) at file.h:320 > #9 0xffffffff8097d560 in sys_ioctl (td=0xfffff801acb8a920, uap=0xfffffe021de9aa40) > at /usr/src-stable/sys/kern/sys_generic.c:718 > #10 0xffffffff80d29c51 in amd64_syscall (td=0xfffff801acb8a920, traced=0) at subr_syscall.c:134 > #11 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391 > #12 0x0000000800ff3f7a in ?? () > Previous frame inner to this frame (corrupt stack?) > > I suspect the daopen() is hanging on the disk associated with the given > passX device. > > Are you running smartd or anything else that accesses passX devices? > > Also, do you have a coredump from this or were you just running kgdb against > the live system? If you have a coredump you can run ps against it (or use > some kgdb macros I have) to see the command name of the process doing the > passX ioctls. We could also see which passX devices were blocking to see if > this is related to either AHCI or USB. Also, do you possibly have any errors > in your dmesg from the disk devices? > Thanks for looking at this. I updated a few days ago - FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275731: Sat Dec 13 08:36:29 ACDT 2014 root@leader.local:/usr/obj/usr/src/sys/GENERIC amd64 I haven't got any core dumps as I only get to the stage of no process creation (yesterday I had it freeze completely) leaving no choice but to reset. I leave a root terminal open so that I can run kgdb before resetting. I have sysutils/smartmontools installed but I'm not running smartd, I just have the smart status in daily logs. I'll vote for USB - I have two USB memsticks that I use almost daily - previously (since 10.1RC2) I have had trouble with them. Failure to get the dev entries created, followed with no console messages of device insert/remove. I had noticed this before being unable to create processes so wasn't sure which came first. usbconfig was sometimes the first process I noticed failure to run. I haven't seen this for a while, so not in stable, maybe in release but definitely in RC 2-4 I think it was in RC3 I got errors of - xptioctl: pass driver is not in the kernel xptioctl: put "device pass" in your kernel config file a couple of times with the USB memstick - again not recently. Before release I had an issue with constant zfs writes - I haven't been able to re-create a lockup since RC4 so don't know if that is related. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194654 I'm not seeing any device errors in dmesg, log/messages or log/console.log Someone had mentioned lockups with latest nvidia driver that going back to the previous had fixed, I am getting some x related crashes so was going to look at that. Dec 17 03:21:16 leader kernel: pid 62431 (xcmmtst_big), uid 0: exited on signal 10 (core dumped) Dec 17 03:22:11 leader kernel: pid 69694 (xprobe_gas_x8632), uid 0: exited on signal 11 (core dumped) Dec 17 03:22:11 leader kernel: pid 69779 (xprobe_3dnow), uid 0: exited on signal 4 (core dumped) Dec 17 03:22:42 leader kernel: pid 78410 (xprobe_gas_x8632), uid 0: exited on signal 11 (core dumped) Dec 17 03:22:43 leader kernel: pid 78554 (xprobe_3dnow), uid 0: exited on signal 4 (core dumped) Dec 17 03:46:21 leader kernel: pid 40770 (xdfc), uid 0: exited on signal 10 (core dumped) Dec 17 03:46:26 leader kernel: pid 41000 (xdmmtst), uid 0: exited on signal 10 (core dumped) Dec 17 03:50:55 leader kernel: pid 52533 (xdfc), uid 0: exited on signal 10 (core dumped) Dec 17 03:51:30 leader kernel: pid 54062 (xdfc), uid 0: exited on signal 11 (core dumped) I could run a build of kernel and world with debugging or (is it witness?) for a few days if that would help get more info. -- FreeBSD - the place to B...Software Developing Shane Ambler