From owner-freebsd-current@freebsd.org Thu Jan 21 01:15:36 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5E82AA89E63 for ; Thu, 21 Jan 2016 01:15:36 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1EA641F2A; Thu, 21 Jan 2016 01:15:36 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id B77DEB94B; Wed, 20 Jan 2016 20:15:34 -0500 (EST) From: John Baldwin To: Don Lewis Cc: freebsd-current@freebsd.org Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer Date: Wed, 20 Jan 2016 17:14:56 -0800 Message-ID: <2064398.31Gqv7eiJr@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <201601202315.u0KNF687077387@gw.catspoiler.org> References: <201601202315.u0KNF687077387@gw.catspoiler.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 20 Jan 2016 20:15:34 -0500 (EST) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jan 2016 01:15:36 -0000 On Wednesday, January 20, 2016 03:15:06 PM Don Lewis wrote: > On 8 Dec, John Baldwin wrote: > > On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote: > >> On 8 Dec, John Baldwin wrote: > >> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote: > >> >> On 2 Dec, John Baldwin wrote: > >> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote: > >> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the > >> >> >> > instructions before the call to see if you can spot which register the first > >> >> >> > parameter (saved in %rdi IIRC) comes from. > >> >> >> > >> >> >> Dump of assembler code for function sbuf_printf: > >> >> >> 0xffffffff80a673e0 <+0>: push %rbp > >> >> >> 0xffffffff80a673e1 <+1>: mov %rsp,%rbp > >> >> >> 0xffffffff80a673e4 <+4>: push %r14 > >> >> >> 0xffffffff80a673e6 <+6>: push %rbx > >> >> >> 0xffffffff80a673e7 <+7>: sub $0x50,%rsp > >> >> >> 0xffffffff80a673eb <+11>: mov %rsi,%r14 > >> >> >> 0xffffffff80a673ee <+14>: mov %rdi,%rbx > >> >> >> 0xffffffff80a673f1 <+17>: mov %r9,-0x38(%rbp) > >> >> >> 0xffffffff80a673f5 <+21>: mov %r8,-0x40(%rbp) > >> >> >> 0xffffffff80a673f9 <+25>: mov %rcx,-0x48(%rbp) > >> >> >> 0xffffffff80a673fd <+29>: mov %rdx,-0x50(%rbp) > >> >> >> 0xffffffff80a67401 <+33>: lea -0x60(%rbp),%rax > >> >> >> 0xffffffff80a67405 <+37>: mov %rax,-0x20(%rbp) > >> >> >> 0xffffffff80a67409 <+41>: lea 0x10(%rbp),%rax > >> >> >> 0xffffffff80a6740d <+45>: mov %rax,-0x28(%rbp) > >> >> >> 0xffffffff80a67411 <+49>: movl $0x30,-0x2c(%rbp) > >> >> >> 0xffffffff80a67418 <+56>: movl $0x10,-0x30(%rbp) > >> >> >> 0xffffffff80a6741f <+63>: mov $0xffffffff8137bdf8,%rdi > >> >> >> 0xffffffff80a67426 <+70>: mov %rbx,%rsi > >> >> >> 0xffffffff80a67429 <+73>: callq 0xffffffff80a66c00 <_assert_sbuf_integrity> > >> >> >> > >> >> >> > >> >> >> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd > >> >> >> 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi > >> >> >> 0xffffffff80a237c5 <+837>: xor %eax,%eax > >> >> >> 0xffffffff80a237c7 <+839>: mov %r12,%rdi > >> >> >> 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx > >> >> >> 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 > >> >> >> => 0xffffffff80a237d6 <+854>: inc %r14d > >> >> >> 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd > >> >> > > >> >> > So maybe try 'p $r12' in the corefile_open() frame. > >> >> > >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, > >> >> uid=, pid=, td=, > >> >> vpp=, namep=) > >> >> at /usr/src/sys/kern/kern_sig.c:3188 > >> >> 3188 sbuf_printf(&sb, "%s", comm); > >> >> (kgdb) p $r12 > >> >> $1 = 0 > >> > > >> > So it's definitely zero. :( The next step is probably to disassemble the > >> > corefile_open function (ugh) and walk backwards to find where %r12 is read > >> > from. It might be from a local variable on the stack, so then you would > >> > want to examine that memory in the stack and the surrounding memory to see > >> > if there is memory corruption and perhaps if there is anything recognizable > >> > about it (e.g. if the corruption contains some sort of data you recognize, > >> > or if the corruption is bounded by a certain length, etc.). It's a bit of > >> > a shot in the dark though. > >> > > >> > Is this reproducible? > >> > >> No it's not. The only time it happened was when there was a swap > >> timeout, probably because of a lengthy deep recovery on one of the > >> mirrored swap devices. > >> > >> The code in question is: > >> struct sbuf sb; > >> [snip] > >> (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN); > >> [snip] > >> for (i = 0; format[i] != '\0'; i++) { > >> switch (format[i]) { > >> case '%': /* Format character */ > >> i++; > >> switch (format[i]) { > >> [snip] > >> case 'N': /* process name */ > >> sbuf_printf(&sb, "%s", comm); > >> break; > >> > >> > >> &sb is used in a bunch of places, so the compiler probably computes its > >> value once by adding the proper offset to the stack pointer and stashing > >> the result in a register. Since kern.corefile is "%N.core", the failure > >> is happening on the first interation of the loop, so there isn't much > >> opportunity for things to get corrupted. Also, the control flow in this > >> function only depends on the format, so there shouldn't be anything > >> special about a swap timeout vs. a segfault generated core. > > > > Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well, > > but I've sometimes seen the compiler spill local vars onto the stack due to > > register pressure. That said, I think it is unlikely it would have to spill > > &sb during the early part of the function. :( > > I had some more time to look at this. It's a bit messy because > corefile_open() and coredump() both get inlined into sig_exit(). Here's > the section of code from the malloc() call and sbuf_new() to the failing > sbuf_printf() call: > > 0xffffffff80a23685 <+517>: callq 0xffffffff80a00680 > 0xffffffff80a2368a <+522>: mov %rax,-0x238(%rbp) > 0xffffffff80a23691 <+529>: lea -0x138(%rbp),%r12 > .... > 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd > 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi > 0xffffffff80a237c5 <+837>: xor %eax,%eax > 0xffffffff80a237c7 <+839>: mov %r12,%rdi > 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx > 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 > => 0xffffffff80a237d6 <+854>: inc %r14d > 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd > 0xffffffff80a237de <+862>: mov $0xffffffff81361469,%rsi > > Basically it's: > lea -0x138(%rbp),%r12 > [ snip ] > mov %r12,%rdi > callq sbuf_printf Yeah, at this point I can only think of bizarre-o things like a trap clobbering %r12 in the saved trapframe or the like. I can't imagine why that would happen though. If we had a bug that clobbered saved registers in the trapframe you would expect it to happen far more often. OTOH, if some code had a stale pointer that was later reused by a kthread's stack and the thread was interrupted (e.g. by a device interrupt) while in this routine, _and_ the stale pointer was indirected and trashed the saved %r12, then that would explain this. I just feel like I have better odds of seeing a unicorn in my driveway tonight. :-/ -- John Baldwin