From owner-freebsd-current@FreeBSD.ORG Tue Feb 12 11:22:24 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 23B72F49 for ; Tue, 12 Feb 2013 11:22:24 +0000 (UTC) (envelope-from hlh@restart.be) Received: from tignes.restart.be (tignes.restart.be [IPv6:2001:41d0:8:bdbe:0:1::]) by mx1.freebsd.org (Postfix) with ESMTP id A0DCF804 for ; Tue, 12 Feb 2013 11:22:23 +0000 (UTC) Received: from restart.be (avoriaz.tunnel.bel [IPv6:2001:41d0:8:bdbe:1:ffff::]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "smtp.restart.be", Issuer "CA master" (verified OK)) by tignes.restart.be (Postfix) with ESMTPS id 3Z51g20pmxzHYl; Tue, 12 Feb 2013 12:22:22 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.7.4 tignes.restart.be 3Z51g20pmxzHYl DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=restart.be; s=tignes; t=1360668142; bh=JzKS+0pEVPlBZKXja+RjesOgbugF4lgtcZt0siPtPJg=; h=Date:From:To:CC:Subject:References:In-Reply-To; z=Date:=20Tue,=2012=20Feb=202013=2012:22:20=20+0100|From:=20Henri=2 0Hennebert=20|To:=20Brandon=20Gooch=20|CC:=20d@delphij.net,=20Konstantin=20Belousov=20< kostikbel@gmail.com>,=0D=0A=20=20=20=20=20=20=20=20freebsd-current @freebsd.org|Subject:=20Re:=20sysctl=20-a=20causes=20kernel=20trap =2012|References:=20<50EB602F.9050300@delphij.net>=20<201301080002 33.GZ82219@kib.kiev.ua>=20<50EB63A9.50903@delphij.net>=20=20<50E B870D.3020306@delphij.net>=20<50EF3FEC.60605@delphij.net>=20=2 0<50F9B70A.5040305@delphij.net>=20|In-Reply-To:=20; b=fSW4hyDjEyAbKmFw9ie6YRRqfNuC2+SAeDaq058bI7mzj93TAM4EjJPmb671w3YC+ yKrAPacpWmlpu1BR+ysR6BekGjhdBJk62PNtSICAWWAwNi/5jDAJM4NX2asboEi6q6 FzVQOe61mdhR4tswxzQ1HYpzN9ZLPn+7wS8DU8qQsVEiaYmlkjYpXkL/vGc2GuCxAr eIheeZk0d2+sbA6VuliuDA3pB5ZlI/QmT+7duQ3VyFjp4rkpgsrAn26/B3zu1I9YKe BL0nm1W1AM8nlxdt6jySJ70kMXUbecPpouzSn9crKQnT5hfeRZM3Zr9xThPfsqIvmG X/Ln4JdyNANQw== Received: from morzine.restart.bel (morzine.restart.be [IPv6:2001:41d0:8:bdbe:1:2::]) (authenticated bits=0) by restart.be (8.14.6/8.14.5) with ESMTP id r1CBMKxZ027708; Tue, 12 Feb 2013 12:22:20 +0100 (CET) (envelope-from hlh@restart.be) Message-ID: <511A25EC.8070000@restart.be> Date: Tue, 12 Feb 2013 12:22:20 +0100 From: Henri Hennebert Organization: RestartSoft User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:17.0) Gecko/20130207 Thunderbird/17.0.2 MIME-Version: 1.0 To: Brandon Gooch Subject: Re: sysctl -a causes kernel trap 12 References: <50EB602F.9050300@delphij.net> <20130108000233.GZ82219@kib.kiev.ua> <50EB63A9.50903@delphij.net> <50EB870D.3020306@delphij.net> <50EF3FEC.60605@delphij.net> <50F9B70A.5040305@delphij.net> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Konstantin Belousov , freebsd-current@freebsd.org, d@delphij.net X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Feb 2013 11:22:24 -0000 On 01/19/2013 06:58, Brandon Gooch wrote: > On Fri, Jan 18, 2013 at 2:56 PM, Xin Li wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA512 >> >> On 01/18/13 12:50, Brandon Gooch wrote: >>> On Thu, Jan 10, 2013 at 4:25 PM, Xin Li >> > wrote: >>> >>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 >>> >>> To all: this became more and more hard to replicate lately. I've >>> tried these options and the most important progress is that it's >>> possible to get a crashdump when debug.debugger_on_panic=0 and I >>> managed to get a backtrace which indicates the panic occur when >>> trying to do mtx_lock(&Giant) -> __mtx_lock_sleep -> turnstile_wait >>> -> propagate_priority, but after I've added some instruments to >>> the surrounding code and enabled INVARIANT and/or WITNESS, it >>> mysteriously went away. >>> >>> Reverting my instruments code and update to latest svn makes the >>> issue disappear for one day. I've hit it again today but >>> unfortunately didn't get a successful dump and after reboot I can't >>> reproduce it again :( >>> >>> Still trying... >>> >>> >>> Any updates Xin? >> >> No, it mysteriously disappeared for now. According to my >> understanding to recent svn commits, I didn't see anybody committing >> something that fixes it but I can no longer panic my system, with or >> without debugging code :( >> >>> I was actually hitting what I believe to be exactly the same issue >>> as you on one of my systems, and, as you've seen, adding any extra >>> debugging or diagnostics seemed to eliminate the issue. >>> >>> I was able to generate quite a few vmcores and still have these >>> sitting around in my filesystem (along with the kernels that helped >>> produce them). >>> >>> I can recreate this crash on my system by compiling the NVIDIA >>> driver with clang at -01 and above. Although it's been noted that >>> this issue has been seen in scenarios without an NIVIDIA driver in >>> the mix, whatever is happening in the kernel to cause the panic is >>> somehow triggered by this, at least on my system. >> >> I'm not sure if this is the same problem. Could you please try using >> gcc to compile the nVIdia driver and see if that "fixes" the problem? >> >> Cheers, >> - -- >> Xin LI https://www.delphij.net/ >> FreeBSD - The Power to Serve! Live free or die >> > > Indeed, a gcc compiled NVIDIA module eliminates the issue, sorry if I > hadn't mentioned this earlier. > > What was happening to me at first was that my system would just hang while > booting. I was able to figure out that it was during /etc/rc.d/initrandom. > I actually got to a point where I removed the call to sysctl -a from > 'better_than_nothing()' in /etc/rc.d/initrandom to have a booting system. I > finally had a situation where I could get a panic by adding SW_WATCHDOG to > my kernel and running watchdogd(8). > > For me, this panic would come and go seemingly at random as well, and I > couldn't fumble my way around in the debugger to learn much of anything > when I first started seeing it. I just started a process of modularizing > everything I could in my kernel config, then loading modules 1-by-1 and > booting over-and-over until I finally found what appeared to be the > problem, which was the NVIDIA module compiled with clang. > > Oh, another thing: at times it seemed as though it was the number of > modules loaded, as I could get the hang with 41 modules loaded, but not 40 > or 42?! I admit, when I was seeing that behavior, I hadn't eliminated the > NVIDIA driver from my loaded modules. I need to revisit the panic situation > to confirm this particular strangeness. > > Here's the last panic I had: > > Unread portion of the kernel message buffer: > = DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags = interrupt enabled, resume, IOPL = 0 > current process = 1175 (sysctl) > > (kgdb) bt > #0 doadump (textdump=1694704112) at pcpu.h:229 > #1 0xffffffff802fab82 in db_fncall (dummy1=, > dummy2=, dummy3=, dummy4= optimized out>) at /usr/src/sys/ddb/db_command.c:578 > #2 0xffffffff802fa85a in db_command (last_cmdp=, > cmd_table=, dopager=1) at > /usr/src/sys/ddb/db_command.c:449 > #3 0xffffffff802fa612 in db_command_loop () at > /usr/src/sys/ddb/db_command.c:502 > #4 0xffffffff802fcf60 in db_trap (type=, code=0) at > /usr/src/sys/ddb/db_main.c:231 > #5 0xffffffff804a7b93 in kdb_trap (type=12, code=0, tf= out>) at /usr/src/sys/kern/subr_kdb.c:654 > #6 0xffffffff807157c5 in trap_fatal (frame=0xffffff8865032670, eva= optimized out>) at /usr/src/sys/amd64/amd64/trap.c:867 > #7 0xffffffff80715adb in trap_pfault (frame=0x0, usermode=0) at > /usr/src/sys/amd64/amd64/trap.c:698 > #8 0xffffffff8071529b in trap (frame=0xffffff8865032670) at > /usr/src/sys/amd64/amd64/trap.c:463 > #9 0xffffffff806ff382 in calltrap () at exception.S:228 > #10 0xffffffff8047bd50 in sysctl_sysctl_next_ls (lsp=, > name=0xffffff8865032a80, namelen=, > next=0xffffff8865032898, len=0xffffff8865032904, level=3) at > /usr/src/sys/kern/kern_sysctl.c:759 > #11 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080, > name=0xffffff8865032a7c, namelen=, > next=0xffffff8865032894, len=0xffffff8865032904, level=2) at > /usr/src/sys/kern/kern_sysctl.c:786 > #12 0xffffffff8047be5e in sysctl_sysctl_next_ls (lsp=0xfffffe000d3f0080, > name=0xffffff8865032a78, namelen=, > next=0xffffff8865032890, len=0xffffff8865032904, level=1) at > /usr/src/sys/kern/kern_sysctl.c:786 > #13 0xffffffff8047bca3 in sysctl_sysctl_next (oidp=, > arg1=0xffffff8865032a78, arg2=4, req=0xffffff88650329a8) at > /usr/src/sys/kern/kern_sysctl.c:808 > #14 0xffffffff8047b03f in sysctl_root (arg1=, > arg2=) at /usr/src/sys/kern/kern_sysctl.c:1513 > #15 0xffffffff8047b5d8 in userland_sysctl (td=, > name=0xffffff8865032a70, namelen=, old= optimized out>, oldlenp=, inkernel= out>, new=, newlen=, > retval=, flags=1694706064) at > /usr/src/sys/kern/kern_sysctl.c:1623 > #16 0xffffffff8047b3c4 in sys___sysctl (td=0xfffffe001e2d4900, > uap=0xffffff8865032b80) at /usr/src/sys/kern/kern_sysctl.c:1549 > #17 0xffffffff807160f7 in amd64_syscall (td=0xfffffe001e2d4900, traced=0) > at subr_syscall.c:135 > #18 0xffffffff806ff66b in Xfast_syscall () at exception.S:387 > #19 0x000000080093697a in ?? () > Previous frame inner to this frame (corrupt stack?) > Current language: auto; currently minimal > > Any ideas on where to look through this vmcore? > > -Brandon FWIW Just going from 9.1-STABLE r245423M to 9.1-STABLE #0 r246457M trigger this problem. I drop sysctl -a from /etc/rc.d/initrandom and all is back to normal. I have nvidia-driver-304.64 compiled with gcc as for all my ports. Henri