From owner-freebsd-current@freebsd.org Tue Dec 8 19:40:05 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0BE619D5CF6 for ; Tue, 8 Dec 2015 19:40:05 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id DC5141DC9; Tue, 8 Dec 2015 19:40:04 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id tB8JduPg003988; Tue, 8 Dec 2015 11:40:00 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <201512081940.tB8JduPg003988@gw.catspoiler.org> Date: Tue, 8 Dec 2015 11:39:56 -0800 (PST) From: Don Lewis Subject: Re: panic: sbuf_vprintf called with a NULL sbuf pointer To: jhb@freebsd.org cc: freebsd-current@freebsd.org In-Reply-To: <2133429.08nAJynLee@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Dec 2015 19:40:05 -0000 On 8 Dec, John Baldwin wrote: > On Monday, December 07, 2015 10:10:51 PM Don Lewis wrote: >> On 2 Dec, John Baldwin wrote: >> > On Wednesday, December 02, 2015 01:25:56 PM Don Lewis wrote: >> >> > If you want to look at this further, try going to frame 16 and dissassembling the >> >> > instructions before the call to see if you can spot which register the first >> >> > parameter (saved in %rdi IIRC) comes from. >> >> >> >> Dump of assembler code for function sbuf_printf: >> >> 0xffffffff80a673e0 <+0>: push %rbp >> >> 0xffffffff80a673e1 <+1>: mov %rsp,%rbp >> >> 0xffffffff80a673e4 <+4>: push %r14 >> >> 0xffffffff80a673e6 <+6>: push %rbx >> >> 0xffffffff80a673e7 <+7>: sub $0x50,%rsp >> >> 0xffffffff80a673eb <+11>: mov %rsi,%r14 >> >> 0xffffffff80a673ee <+14>: mov %rdi,%rbx >> >> 0xffffffff80a673f1 <+17>: mov %r9,-0x38(%rbp) >> >> 0xffffffff80a673f5 <+21>: mov %r8,-0x40(%rbp) >> >> 0xffffffff80a673f9 <+25>: mov %rcx,-0x48(%rbp) >> >> 0xffffffff80a673fd <+29>: mov %rdx,-0x50(%rbp) >> >> 0xffffffff80a67401 <+33>: lea -0x60(%rbp),%rax >> >> 0xffffffff80a67405 <+37>: mov %rax,-0x20(%rbp) >> >> 0xffffffff80a67409 <+41>: lea 0x10(%rbp),%rax >> >> 0xffffffff80a6740d <+45>: mov %rax,-0x28(%rbp) >> >> 0xffffffff80a67411 <+49>: movl $0x30,-0x2c(%rbp) >> >> 0xffffffff80a67418 <+56>: movl $0x10,-0x30(%rbp) >> >> 0xffffffff80a6741f <+63>: mov $0xffffffff8137bdf8,%rdi >> >> 0xffffffff80a67426 <+70>: mov %rbx,%rsi >> >> 0xffffffff80a67429 <+73>: callq 0xffffffff80a66c00 <_assert_sbuf_integrity> >> >> >> >> >> >> 0xffffffff80a237b9 <+825>: jmpq 0xffffffff80a236fd >> >> 0xffffffff80a237be <+830>: mov $0xffffffff80fd8ad3,%rsi >> >> 0xffffffff80a237c5 <+837>: xor %eax,%eax >> >> 0xffffffff80a237c7 <+839>: mov %r12,%rdi >> >> 0xffffffff80a237ca <+842>: mov -0x228(%rbp),%rdx >> >> 0xffffffff80a237d1 <+849>: callq 0xffffffff80a673e0 >> >> => 0xffffffff80a237d6 <+854>: inc %r14d >> >> 0xffffffff80a237d9 <+857>: jmpq 0xffffffff80a236fd >> > >> > So maybe try 'p $r12' in the corefile_open() frame. >> >> #17 0xffffffff80a237d6 in corefile_open (compress=0, comm=, >> uid=, pid=, td=, >> vpp=, namep=) >> at /usr/src/sys/kern/kern_sig.c:3188 >> 3188 sbuf_printf(&sb, "%s", comm); >> (kgdb) p $r12 >> $1 = 0 > > So it's definitely zero. :( The next step is probably to disassemble the > corefile_open function (ugh) and walk backwards to find where %r12 is read > from. It might be from a local variable on the stack, so then you would > want to examine that memory in the stack and the surrounding memory to see > if there is memory corruption and perhaps if there is anything recognizable > about it (e.g. if the corruption contains some sort of data you recognize, > or if the corruption is bounded by a certain length, etc.). It's a bit of > a shot in the dark though. > > Is this reproducible? No it's not. The only time it happened was when there was a swap timeout, probably because of a lengthy deep recovery on one of the mirrored swap devices. The code in question is: struct sbuf sb; [snip] (void)sbuf_new(&sb, name, MAXPATHLEN, SBUF_FIXEDLEN); [snip] for (i = 0; format[i] != '\0'; i++) { switch (format[i]) { case '%': /* Format character */ i++; switch (format[i]) { [snip] case 'N': /* process name */ sbuf_printf(&sb, "%s", comm); break; &sb is used in a bunch of places, so the compiler probably computes its value once by adding the proper offset to the stack pointer and stashing the result in a register. Since kern.corefile is "%N.core", the failure is happening on the first interation of the loop, so there isn't much opportunity for things to get corrupted. Also, the control flow in this function only depends on the format, so there shouldn't be anything special about a swap timeout vs. a segfault generated core. How is gdb able to print the register contents for an arbitrary stack frame? It's not like this is a SPARC with register windows. Aren't only the final register values when the core dump was generated saved along with the memory contents? Unless a register contents are pushed onto the stack, I would think that if a callee overwrites a register, its contents are gone.