From owner-freebsd-current@freebsd.org Wed Dec 2 21:26:05 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9833A3F149 for ; Wed, 2 Dec 2015 21:26:04 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id BDD6C14A6; Wed, 2 Dec 2015 21:26:04 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id tB2LPuHA021478; Wed, 2 Dec 2015 13:26:00 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201512022126.tB2LPuHA021478@gw.catspoiler.org> Date: Wed, 2 Dec 2015 13:25:56 -0800 (PST) From: Don Lewis Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer To: jhb@freebsd.org cc: freebsd-current@freebsd.org In-Reply-To: <2392832.V4SzgugFk4@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Dec 2015 21:26:05 -0000 On 2 Dec, John Baldwin wrote: > On Sunday, November 29, 2015 04:16:11 PM Don Lewis wrote: >> I got this strange panic the other day: >> >> panic: sbuf_vprintf called with a NULL sbuf pointer >> >> The machine was running this: >> >> FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #31 r290224: Sat Oct 31 02:56:36 PDT 2015 dl@zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC amd64 >> >> >> The trigger was: >> >> swap_pager: indefinite wait buffer: bufobj: 0 blkno: 1737153, >> size 4096 >> >> That triggered an attempt to kill a process and create a core file for >> it. >> >> Where things get strange is at #16 and #17 in the backtrace: >> >> #13 0xffffffff80a20676 in kassert_panic ( >> fmt=0xffffffff8137bf84 "%s called with a NULL sbuf pointer") >> at /usr/src/sys/kern/kern_shutdown.c:647 >> #14 0xffffffff80a66c51 in _assert_sbuf_integrity ( >> fun=0xffffffff8137bdf8 "sbuf_vprintf", s=0x0) >> at /usr/src/sys/kern/subr_sbuf.c:103 >> #15 0xffffffff80a6742e in sbuf_vprintf (s=0x0, fmt=, >> ap=) at /usr/src/sys/kern/subr_sbuf.c:583 >> #16 sbuf_printf (s=0x0, fmt=0xffffffff80fd8ad3 "%s") >> at /usr/src/sys/kern/subr_sbuf.c:677 >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, >> uid=, pid=, td=, >> vpp=, namep=) >> at /usr/src/sys/kern/kern_sig.c:3188 >> #18 coredump (td=0x0) at /usr/src/sys/kern/kern_sig.c:3337 >> #19 sigexit (td=0x0, sig=40) at /usr/src/sys/kern/kern_sig.c:2976 >> #20 0xffffffff80a6fcb8 in userret (td=0xfffff8057b0a59a0, >> frame=) at /usr/src/sys/kern/subr_trap.c:147 >> #21 0xffffffff80e68619 in syscallret (td=0xfffff8057b0a59a0, >> error=, sa=) >> at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 >> #22 amd64_syscall (td=0xfffff8057b0a59a0, traced=0) >> at /usr/src/sys/amd64/amd64/trap.c:956 >> #23 0xffffffff80e47a3b in Xfast_syscall () >> at /usr/src/sys/amd64/amd64/exception.S:394 >> #24 0x000000080cc9f12a in ?? () >> >> In #16, s is 0x0, which is strange because kern_sig.c:3188 is: >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, >> uid=, pid=, td=, >> vpp=, namep=) >> at /usr/src/sys/kern/kern_sig.c:3188 >> 3188 sbuf_printf(&sb, "%s", comm); >> >> I don't see how &sb can be NULL ... > > Agreed on that. > >> Also interesting is: >> >> (kgdb) print sb >> $1 = >> >> I don't see how sb can be optimized out since its address is passed to >> sbuf_printf(). > > This is just what gdb says when it doesn't think it can find the value > becuase the regsiter holding it (according to the debug info) has been > overwritten. Sometimes if you look in the disassembly you can find another > copy of the value in another register and figure out the value thay way. > If you use kgdb710 from the devel/gdb port it sometimes does a better job > here, but even with it I often run into this. I am using kgdb710. >> While format is optimized out, the corefilename isn't and >> format=corefilename >> >> (kgdb) print format >> $2 = >> (kgdb) print corefilename >> $3 = "%N.core", '\000' >> >> Since this is the first iteration of the loop, and the crash happens on >> the entry into sbuf_printf()->sbuf_vprintf(), it doesn't look like it >> could be caused by stack smash. >> >> Under other circumstances I've seen other processes dump core, and I can >> trigger core dumps by sending SIGQUIT to processes ... >> >> %sleep 10 & >> [1] 2456 >> %kill -QUIT %1 >> % >> [1] Quit sleep 10 (core dumped) >> >> Another other oddity is td=0x0 in frames 18 and 19. I'm guessing that >> kgdb is wrong about this because that should cause a NULL pointer >> dereference panic in sigexit(). > > Probably the saved register on the stack that gdb thinks td is saved in has > been overwritten (so this could be a buffer overrun onto the stack after > all). > >> Also sig=40 in sigexit() is strange since the highest numbered signal is >> 33. >> >> Lastly, I don't see how we get from userret() to sigexit(). Line 147 in >> subr_trap.c (userret()) is: >> WITNESS_WARN(WARN_PANIC, NULL, "userret: returning"); > > I suspect there is some tail-call optimization or some such so that the function > userret() called branched go sigexit() instead of calling it (since it is marked > __dead2, meaning it never returns). kgdb710 might cope with this better (it > constructs "fake" frames for inlined functions for example). > > If you want to look at this further, try going to frame 16 and dissassembling the > instructions before the call to see if you can spot which register the first > parameter (saved in %rdi IIRC) comes from. Dump of assembler code for function sbuf_printf: 0xffffffff80a673e0 <+0>: push %rbp 0xffffffff80a673e1 <+1>: mov %rsp,%rbp 0xffffffff80a673e4 <+4>: push %r14 0xffffffff80a673e6 <+6>: push %rbx 0xffffffff80a673e7 <+7>: sub $0x50,%rsp 0xffffffff80a673eb <+11>: mov %rsi,%r14 0xffffffff80a673ee <+14>: mov %rdi,%rbx 0xffffffff80a673f1 <+17>: mov %r9,-0x38(%rbp) 0xffffffff80a673f5 <+21>: mov %r8,-0x40(%rbp) 0xffffffff80a673f9 <+25>: mov %rcx,-0x48(%rbp) 0xffffffff80a673fd <+29>: mov %rdx,-0x50(%rbp) 0xffffffff80a67401 <+33>: lea -0x60(%rbp),%rax 0xffffffff80a67405 <+37>: mov %rax,-0x20(%rbp) 0xffffffff80a67409 <+41>: lea 0x10(%rbp),%rax 0xffffffff80a6740d <+45>: mov %rax,-0x28(%rbp) 0xffffffff80a67411 <+49>: movl $0x30,-0x2c(%rbp) 0xffffffff80a67418 <+56>: movl $0x10,-0x30(%rbp) 0xffffffff80a6741f <+63>: mov $0xffffffff8137bdf8,%rdi 0xffffffff80a67426 <+70>: mov %rbx,%rsi 0xffffffff80a67429 <+73>: callq 0xffffffff80a66c00 <_assert_sbuf_integrity> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi 0xffffffff80a237c5 <+837>: xor %eax,%eax 0xffffffff80a237c7 <+839>: mov %r12,%rdi 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 => 0xffffffff80a237d6 <+854>: inc %r14d 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd