From owner-freebsd-current@freebsd.org  Thu Jan 21 01:15:36 2016
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5E82AA89E63
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Thu, 21 Jan 2016 01:15:36 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 1EA641F2A;
 Thu, 21 Jan 2016 01:15:36 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net
 [73.231.226.104])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id B77DEB94B;
 Wed, 20 Jan 2016 20:15:34 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Don Lewis <truckman@freebsd.org>
Cc: freebsd-current@freebsd.org
Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer
Date: Wed, 20 Jan 2016 17:14:56 -0800
Message-ID: <2064398.31Gqv7eiJr@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; )
In-Reply-To: <201601202315.u0KNF687077387@gw.catspoiler.org>
References: <201601202315.u0KNF687077387@gw.catspoiler.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 20 Jan 2016 20:15:34 -0500 (EST)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jan 2016 01:15:36 -0000

On Wednesday, January 20, 2016 03:15:06 PM Don Lewis wrote:
> On  8 Dec, John Baldwin wrote:
> > On Tuesday, December 08, 2015 11:39:56 AM Don Lewis wrote:
> >> On  8 Dec, John Baldwin wrote:
> >> > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote:
> >> >> On  2 Dec, John Baldwin wrote:
> >> >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote:
> >> >> >> > If you want to look at this further, try going to frame 16 and dissassembling the
> >> >> >> > instructions before the call to see if you can spot which register the first
> >> >> >> > parameter (saved in %rdi IIRC) comes from.
> >> >> >> 
> >> >> >> Dump of assembler code for function sbuf_printf:
> >> >> >>    0xffffffff80a673e0 <+0>:	push   %rbp
> >> >> >>    0xffffffff80a673e1 <+1>:	mov    %rsp,%rbp
> >> >> >>    0xffffffff80a673e4 <+4>:	push   %r14
> >> >> >>    0xffffffff80a673e6 <+6>:	push   %rbx
> >> >> >>    0xffffffff80a673e7 <+7>:	sub    $0x50,%rsp
> >> >> >>    0xffffffff80a673eb <+11>:	mov    %rsi,%r14
> >> >> >>    0xffffffff80a673ee <+14>:	mov    %rdi,%rbx
> >> >> >>    0xffffffff80a673f1 <+17>:	mov    %r9,-0x38(%rbp)
> >> >> >>    0xffffffff80a673f5 <+21>:	mov    %r8,-0x40(%rbp)
> >> >> >>    0xffffffff80a673f9 <+25>:	mov    %rcx,-0x48(%rbp)
> >> >> >>    0xffffffff80a673fd <+29>:	mov    %rdx,-0x50(%rbp)
> >> >> >>    0xffffffff80a67401 <+33>:	lea    -0x60(%rbp),%rax
> >> >> >>    0xffffffff80a67405 <+37>:	mov    %rax,-0x20(%rbp)
> >> >> >>    0xffffffff80a67409 <+41>:	lea    0x10(%rbp),%rax
> >> >> >>    0xffffffff80a6740d <+45>:	mov    %rax,-0x28(%rbp)
> >> >> >>    0xffffffff80a67411 <+49>:	movl   $0x30,-0x2c(%rbp)
> >> >> >>    0xffffffff80a67418 <+56>:	movl   $0x10,-0x30(%rbp)
> >> >> >>    0xffffffff80a6741f <+63>:	mov    $0xffffffff8137bdf8,%rdi
> >> >> >>    0xffffffff80a67426 <+70>:	mov    %rbx,%rsi
> >> >> >>    0xffffffff80a67429 <+73>:	callq  0xffffffff80a66c00 <_assert_sbuf_integrity>
> >> >> >> 
> >> >> >> 
> >> >> >>    0xffffffff80a237b9 <+825>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> >> >>    0xffffffff80a237be <+830>:	mov    $0xffffffff80fd8ad3,%rsi
> >> >> >>    0xffffffff80a237c5 <+837>:	xor    %eax,%eax
> >> >> >>    0xffffffff80a237c7 <+839>:	mov    %r12,%rdi
> >> >> >>    0xffffffff80a237ca <+842>:	mov    -0x228(%rbp),%rdx
> >> >> >>    0xffffffff80a237d1 <+849>:	callq  0xffffffff80a673e0 <sbuf_printf>
> >> >> >> => 0xffffffff80a237d6 <+854>:	inc    %r14d
> >> >> >>    0xffffffff80a237d9 <+857>:	jmpq   0xffffffff80a236fd <sigexit+637>
> >> >> > 
> >> >> > So maybe try 'p $r12' in the corefile_open() frame.
> >> >> 
> >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
> >> >>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
> >> >>     vpp=<optimized out>, namep=<optimized out>)
> >> >>     at /usr/src/sys/kern/kern_sig.c:3188
> >> >> 3188					sbuf_printf(&sb, "%s", comm);
> >> >> (kgdb) p $r12
> >> >> $1 = 0
> >> > 
> >> > So it's definitely zero. :(  The next step is probably to disassemble the
> >> > corefile_open function (ugh) and walk backwards to find where %r12 is read
> >> > from.  It might be from a local variable on the stack, so then you would
> >> > want to examine that memory in the stack and the surrounding memory to see
> >> > if there is memory corruption and perhaps if there is anything recognizable
> >> > about it (e.g. if the corruption contains some sort of data you recognize,
> >> > or if the corruption is bounded by a certain length, etc.).  It's a bit of
> >> > a shot in the dark though.
> >> > 
> >> > Is this reproducible?
> >> 
> >> No it's not.  The only time it happened was when there was a swap
> >> timeout, probably because of a lengthy deep recovery on one of the
> >> mirrored swap devices.
> >> 
> >> The code in question is:
> >>         struct sbuf sb;
> >> [snip]
> >>         (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN);
> >> [snip]
> >>         for (i = 0; format[i] != '\0'; i++) {
> >>                 switch (format[i]) {
> >>                 case '%':       /* Format character */
> >>                         i++;
> >>                         switch (format[i]) {
> >> [snip]
> >>                         case 'N':       /* process name */
> >>                                 sbuf_printf(&sb, "%s", comm);
> >>                                 break;
> >> 
> >> 
> >> &sb is used in a bunch of places, so the compiler probably computes its
> >> value once by adding the proper offset to the stack pointer and stashing
> >> the result in a register.  Since kern.corefile is "%N.core", the failure
> >> is happening on the first interation of the loop, so there isn't much
> >> opportunity for things to get corrupted.  Also, the control flow in this
> >> function only depends on the format, so there shouldn't be anything
> >> special about a swap timeout vs. a segfault generated core.
> > 
> > Yes, r12 is call-safe (IIRC), so I expect it only computes it once as well,
> > but I've sometimes seen the compiler spill local vars onto the stack due to
> > register pressure.  That said, I think it is unlikely it would have to spill
> > &sb during the early part of the function. :(
> 
> I had some more time to look at this.  It's a bit messy because
> corefile_open() and coredump() both get inlined into sig_exit().  Here's
> the section of code from the malloc() call and sbuf_new() to the failing
> sbuf_printf() call:
> 
>    0xffffffff80a23685 <+517>:   callq  0xffffffff80a00680 <malloc>
>    0xffffffff80a2368a <+522>:   mov    %rax,-0x238(%rbp)
>    0xffffffff80a23691 <+529>:   lea    -0x138(%rbp),%r12
>   ....
>    0xffffffff80a237b9 <+825>:   jmpq   0xffffffff80a236fd <sigexit+637>
>    0xffffffff80a237be <+830>:   mov    $0xffffffff80fd8ad3,%rsi
>    0xffffffff80a237c5 <+837>:   xor    %eax,%eax
>    0xffffffff80a237c7 <+839>:   mov    %r12,%rdi
>    0xffffffff80a237ca <+842>:   mov    -0x228(%rbp),%rdx
>    0xffffffff80a237d1 <+849>:   callq  0xffffffff80a673e0 <sbuf_printf>
> => 0xffffffff80a237d6 <+854>:   inc    %r14d
>    0xffffffff80a237d9 <+857>:   jmpq   0xffffffff80a236fd <sigexit+637>
>    0xffffffff80a237de <+862>:   mov    $0xffffffff81361469,%rsi
> 
> Basically it's:
> 	lea    -0x138(%rbp),%r12
> 	[ snip ]
> 	mov     %r12,%rdi
> 	callq	sbuf_printf

Yeah, at this point I can only think of bizarre-o things like a trap
clobbering %r12 in the saved trapframe or the like.  I can't imagine why
that would happen though.  If we had a bug that clobbered saved registers
in the trapframe you would expect it to happen far more often.  OTOH, if
some code had a stale pointer that was later reused by a kthread's stack
and the thread was interrupted (e.g. by a device interrupt) while in
this routine, _and_ the stale pointer was indirected and trashed the
saved %r12, then that would explain this.  I just feel like I have better
odds of seeing a unicorn in my driveway tonight. :-/

-- 
John Baldwin