From owner-freebsd-current@freebsd.org  Wed Dec  2 11:28:32 2015
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D5637A3E62C
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Wed,  2 Dec 2015 11:28:32 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 935D2131C;
 Wed,  2 Dec 2015 11:28:32 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net
 [73.231.226.104])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 91230B93C;
 Wed,  2 Dec 2015 06:28:31 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-current@freebsd.org
Cc: Don Lewis <truckman@freebsd.org>
Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer
Date: Wed, 02 Dec 2015 03:28:07 -0800
Message-ID: <2392832.V4SzgugFk4@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; )
In-Reply-To: <201511300016.tAU0GB2c093099@gw.catspoiler.org>
References: <201511300016.tAU0GB2c093099@gw.catspoiler.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 02 Dec 2015 06:28:31 -0500 (EST)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Dec 2015 11:28:32 -0000

On Sunday, November 29, 2015 04:16:11 PM Don Lewis wrote:
> I got this strange panic the other day:
> 
> 	panic: sbuf_vprintf called with a NULL sbuf pointer
> 
> The machine was running this:
> 
>   FreeBSD zipper.catspoiler.org 11.0-CURRENT FreeBSD 11.0-CURRENT #31 r290224: Sat Oct 31 02:56:36 PDT 2015     dl@zipper.catspoiler.org:/usr/obj/usr/src/sys/GENERIC  amd64
> 
> 
> The trigger was:
> 
> 	swap_pager: indefinite wait buffer: bufobj: 0 blkno: 1737153,
>         size 4096
> 
> That triggered an attempt to kill a process and create a core file for
> it.
> 
> Where things get strange is at #16 and #17 in the backtrace:
> 
> #13 0xffffffff80a20676 in kassert_panic (
>     fmt=0xffffffff8137bf84 "%s called with a NULL sbuf pointer")
>     at /usr/src/sys/kern/kern_shutdown.c:647
> #14 0xffffffff80a66c51 in _assert_sbuf_integrity (
>     fun=0xffffffff8137bdf8 "sbuf_vprintf", s=0x0)
>     at /usr/src/sys/kern/subr_sbuf.c:103
> #15 0xffffffff80a6742e in sbuf_vprintf (s=0x0, fmt=<optimized out>, 
>     ap=<optimized out>) at /usr/src/sys/kern/subr_sbuf.c:583
> #16 sbuf_printf (s=0x0, fmt=0xffffffff80fd8ad3 "%s")
>     at /usr/src/sys/kern/subr_sbuf.c:677
> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
>     vpp=<optimized out>, namep=<optimized out>)
>     at /usr/src/sys/kern/kern_sig.c:3188
> #18 coredump (td=0x0) at /usr/src/sys/kern/kern_sig.c:3337
> #19 sigexit (td=0x0, sig=40) at /usr/src/sys/kern/kern_sig.c:2976
> #20 0xffffffff80a6fcb8 in userret (td=0xfffff8057b0a59a0, 
>     frame=<optimized out>) at /usr/src/sys/kern/subr_trap.c:147
> #21 0xffffffff80e68619 in syscallret (td=0xfffff8057b0a59a0, 
>     error=<optimized out>, sa=<optimized out>)
>     at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187
> #22 amd64_syscall (td=0xfffff8057b0a59a0, traced=0)
>     at /usr/src/sys/amd64/amd64/trap.c:956
> #23 0xffffffff80e47a3b in Xfast_syscall ()
>     at /usr/src/sys/amd64/amd64/exception.S:394
> #24 0x000000080cc9f12a in ?? ()
> 
> In #16, s is 0x0, which is strange because kern_sig.c:3188 is:
> 
> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=<optimized out>, 
>     uid=<optimized out>, pid=<optimized out>, td=<optimized out>, 
>     vpp=<optimized out>, namep=<optimized out>)
>     at /usr/src/sys/kern/kern_sig.c:3188
> 3188					sbuf_printf(&sb, "%s", comm);
> 
> I don't see how &sb can be NULL ...

Agreed on that.

> Also interesting is:
> 
> (kgdb) print sb
> $1 = <optimized out>
> 
> I don't see how sb can be optimized out since its address is passed to
> sbuf_printf().

This is just what gdb says when it doesn't think it can find the value
becuase the regsiter holding it (according to the debug info) has been
overwritten.  Sometimes if you look in the disassembly you can find another
copy of the value in another register and figure out the value thay way.
If you use kgdb710 from the devel/gdb port it sometimes does a better job
here, but even with it I often run into this.
 
> While format is optimized out, the corefilename isn't and
> format=corefilename
> 
> (kgdb) print format
> $2 = <optimized out>
> (kgdb) print corefilename
> $3 = "%N.core", '\000' <repeats 1016 times>
> 
> Since this is the first iteration of the loop, and the crash happens on
> the entry into sbuf_printf()->sbuf_vprintf(), it doesn't look like it
> could be caused by stack smash.
> 
> Under other circumstances I've seen other processes dump core, and I can
> trigger core dumps by sending SIGQUIT to processes ...
> 
> %sleep 10 &
> [1] 2456
> %kill -QUIT %1
> %
> [1]    Quit                          sleep 10 (core dumped)
> 
> Another other oddity is td=0x0 in frames 18 and 19.  I'm guessing that
> kgdb is wrong about this because that should cause a NULL pointer
> dereference panic in sigexit().

Probably the saved register on the stack that gdb thinks td is saved in has
been overwritten (so this could be a buffer overrun onto the stack after
all).

> Also sig=40 in sigexit() is strange since the highest numbered signal is
> 33.
> 
> Lastly, I don't see how we get from userret() to sigexit().  Line 147 in
> subr_trap.c (userret()) is:
>          WITNESS_WARN(WARN_PANIC, NULL, "userret: returning");

I suspect there is some tail-call optimization or some such so that the function
userret() called branched go sigexit() instead of calling it (since it is marked
__dead2, meaning it never returns).  kgdb710 might cope with this better (it
constructs "fake" frames for inlined functions for example).

If you want to look at this further, try going to frame 16 and dissassembling the
instructions before the call to see if you can spot which register the first
parameter (saved in %rdi IIRC) comes from.

-- 
John Baldwin