From owner-freebsd-current@freebsd.org Wed Dec 2 11:28:32 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D5637A3E62C for ; Wed, 2 Dec 2015 11:28:32 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 935D2131C; Wed, 2 Dec 2015 11:28:32 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 91230B93C; Wed, 2 Dec 2015 06:28:31 -0500 (EST) From: John Baldwin To: freebsd-current@freebsd.org Cc: Don Lewis Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer Date: Wed, 02 Dec 2015 03:28:07 -0800 Message-ID: <2392832.V4SzgugFk4@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <201511300016.tAU0GB2c093099@gw.catspoiler.org> References: <201511300016.tAU0GB2c093099@gw.catspoiler.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 02 Dec 2015 06:28:31 -0500 (EST) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Dec 2015 11:28:32 -0000 On Sunday, November 29, 2015 04:16:11 PM Don Lewis wrote: > I got this strange panic the other day: > > panic: sbuf_vprintf called with a NULL sbuf pointer > > The machine was running this: > > FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #31 r290224: Sat Oct 31 02:56:36 PDT 2015 dl@zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC amd64 > > > The trigger was: > > swap_pager: indefinite wait buffer: bufobj: 0 blkno: 1737153, > size 4096 > > That triggered an attempt to kill a process and create a core file for > it. > > Where things get strange is at #16 and #17 in the backtrace: > > #13 0xffffffff80a20676 in kassert_panic ( > fmt=0xffffffff8137bf84 "%s called with a NULL sbuf pointer") > at /usr/src/sys/kern/kern_shutdown.c:647 > #14 0xffffffff80a66c51 in _assert_sbuf_integrity ( > fun=0xffffffff8137bdf8 "sbuf_vprintf", s=0x0) > at /usr/src/sys/kern/subr_sbuf.c:103 > #15 0xffffffff80a6742e in sbuf_vprintf (s=0x0, fmt=, > ap=) at /usr/src/sys/kern/subr_sbuf.c:583 > #16 sbuf_printf (s=0x0, fmt=0xffffffff80fd8ad3 "%s") > at /usr/src/sys/kern/subr_sbuf.c:677 > #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, > uid=, pid=, td=, > vpp=, namep=) > at /usr/src/sys/kern/kern_sig.c:3188 > #18 coredump (td=0x0) at /usr/src/sys/kern/kern_sig.c:3337 > #19 sigexit (td=0x0, sig=40) at /usr/src/sys/kern/kern_sig.c:2976 > #20 0xffffffff80a6fcb8 in userret (td=0xfffff8057b0a59a0, > frame=) at /usr/src/sys/kern/subr_trap.c:147 > #21 0xffffffff80e68619 in syscallret (td=0xfffff8057b0a59a0, > error=, sa=) > at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 > #22 amd64_syscall (td=0xfffff8057b0a59a0, traced=0) > at /usr/src/sys/amd64/amd64/trap.c:956 > #23 0xffffffff80e47a3b in Xfast_syscall () > at /usr/src/sys/amd64/amd64/exception.S:394 > #24 0x000000080cc9f12a in ?? () > > In #16, s is 0x0, which is strange because kern_sig.c:3188 is: > > #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, > uid=, pid=, td=, > vpp=, namep=) > at /usr/src/sys/kern/kern_sig.c:3188 > 3188 sbuf_printf(&sb, "%s", comm); > > I don't see how &sb can be NULL ... Agreed on that. > Also interesting is: > > (kgdb) print sb > $1 = > > I don't see how sb can be optimized out since its address is passed to > sbuf_printf(). This is just what gdb says when it doesn't think it can find the value becuase the regsiter holding it (according to the debug info) has been overwritten. Sometimes if you look in the disassembly you can find another copy of the value in another register and figure out the value thay way. If you use kgdb710 from the devel/gdb port it sometimes does a better job here, but even with it I often run into this. > While format is optimized out, the corefilename isn't and > format=corefilename > > (kgdb) print format > $2 = > (kgdb) print corefilename > $3 = "%N.core", '\000' > > Since this is the first iteration of the loop, and the crash happens on > the entry into sbuf_printf()->sbuf_vprintf(), it doesn't look like it > could be caused by stack smash. > > Under other circumstances I've seen other processes dump core, and I can > trigger core dumps by sending SIGQUIT to processes ... > > %sleep 10 & > [1] 2456 > %kill -QUIT %1 > % > [1] Quit sleep 10 (core dumped) > > Another other oddity is td=0x0 in frames 18 and 19. I'm guessing that > kgdb is wrong about this because that should cause a NULL pointer > dereference panic in sigexit(). Probably the saved register on the stack that gdb thinks td is saved in has been overwritten (so this could be a buffer overrun onto the stack after all). > Also sig=40 in sigexit() is strange since the highest numbered signal is > 33. > > Lastly, I don't see how we get from userret() to sigexit(). Line 147 in > subr_trap.c (userret()) is: > WITNESS_WARN(WARN_PANIC, NULL, "userret: returning"); I suspect there is some tail-call optimization or some such so that the function userret() called branched go sigexit() instead of calling it (since it is marked __dead2, meaning it never returns). kgdb710 might cope with this better (it constructs "fake" frames for inlined functions for example). If you want to look at this further, try going to frame 16 and dissassembling the instructions before the call to see if you can spot which register the first parameter (saved in %rdi IIRC) comes from. -- John Baldwin