From owner-freebsd-arm@FreeBSD.ORG Thu Oct 25 05:56:26 2012 Return-Path: Delivered-To: arm@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 42CC5146 for ; Thu, 25 Oct 2012 05:56:26 +0000 (UTC) (envelope-from kientzle@freebsd.org) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id 19CC58FC0A for ; Thu, 25 Oct 2012 05:56:25 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id q9P5uPPc034272; Thu, 25 Oct 2012 05:56:25 GMT (envelope-from kientzle@freebsd.org) Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id t8keecbvjpvwqz7eyxdmhrqm96; Thu, 25 Oct 2012 05:56:24 +0000 (UTC) (envelope-from kientzle@freebsd.org) Subject: Re: Trashed registers returning from kernel? Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: <20121024233812.0eefd07f@fubar.geek.nz> Date: Wed, 24 Oct 2012 22:56:22 -0700 Content-Transfer-Encoding: 7bit Message-Id: <76909EA6-8373-4CF0-9F12-2FA7BBDC9722@freebsd.org> References: <2B1CF099-50F0-46BE-8B02-61309DF93D5F@freebsd.org> <20121024233812.0eefd07f@fubar.geek.nz> To: Andrew Turner X-Mailer: Apple Mail (2.1283) Cc: arm@freebsd.org X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Porting FreeBSD to the StrongARM Processor List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Oct 2012 05:56:26 -0000 On Oct 24, 2012, at 3:38 AM, Andrew Turner wrote: > On Sun, 21 Oct 2012 18:40:08 -0700 > Tim Kientzle wrote: > >> On the BeagleBone, I'm seeing a similar crash in several different >> user land programs. I suspect it's a kernel bug. >> >> Symptom: program is killed with SIGSEGV. Most of the registers >> contain values above 0xc0000000 (pointing into kernel space). >> >> Theory: >> * Registers are not always getting correctly restored on a >> kernel->user transition. >> * SEGV is a consequence. >> >> I can reproduce it semi-consistently by running "emacs existing-file" >> just after a reboot. (But I'm pretty sure this is the same symptoms >> I've seen with several other programs, so I don't think it's a bug in >> emacs.) >> >> Has anyone else seen this on an armv6 system? >> >> Does anyone have suggestions for how to go about debugging this? >> >> Suggestions appreciated. > > Can you find if the crash happens after a single syscall or is it > after many different sys calls? I've not managed to reproduce it running under ktrace. There are a few consistencies that make me suspect it's a single syscall. (In emacs, it always happens just after saving a file.) But it's maddeningly infrequent, so I don't think it's a consistent bug in a particular syscall. Rather, some occasional combination of events is leading to a botched return to userland. > How consistent are the register values > and instruction that causes the SEGV? There are a few consistencies in the registers. Don't know the instruction, though, because the PC is trashed, too. Sometimes the PC is null, sometimes it's pointing to a structure in kernel space. I don't know the kernel code well enough to guess what the structure is, though. > Have you identified any other programs that have the same issue? In emacs, it always happens just after saving a file. Some of the registers contain addresses in witness and mtx_assert code, and I've seen similar values in a core dump I got from svn, so I think it affects svn as well. A while back I was seeing occasional crashes in install(1). In that case, the PC was always pointing to code just after a call to fchflags(). Unfortunately, I don't have any of those core dumps handy right now so can't look at the registers and see if there are any things in common. Also, the bug I'm seeing in emacs and svn is trashing the PC as well, so that's a little suspect. > The relevant code to save the registers with system calls is in > sys/arm/arm/exception.S and sys/arm/include/asmacros.h. > > In exception.S there is the function swi_entry. It: > - Saves the registers to the stack. > - Stores sp in r0 to be passed in as the argument to swi_handler() > - Stores sp in r6 to allow us to restore it later > - Aligns the stack > - Calls swi_handler() to perform the system call > - Restores the stack pointer from r6 > - Performs any asynchronous software trap (calls ast() if required) > - Restores the registers from the stack > - Returns to userland > > Assuming it is a syscall causing this I can think of 3 possible causes: > 1. Someone is clobbering the stack. > 2. Someone is clobbering the trap frame. > 3. There is a cache issue causing old data to be written to the stack. > > Checking 1 should be easy. In exception.S add the instruction "sub sp, > sp, #32" before the bic instruction. This will add padding to the > stack. You may need to change the #32 if it is not large enough. This > won't help if the issue is in ast(). Thanks, Andrew. You've given me a lot to work with. One interesting observation: I haven't seen a kernel panic on BeagleBone in quite a while. So the trap frame seems a little more likely; I would expect a stack or cache issue to sometimes panic the kernel. Will take me a couple of weeks probably to follow up on this. (Not a lot of spare time after $DAYJOB.) Tim