From owner-freebsd-arm@FreeBSD.ORG  Thu Oct 25 05:56:26 2012
Return-Path: <owner-freebsd-arm@FreeBSD.ORG>
Delivered-To: arm@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 42CC5146
 for <arm@freebsd.org>; Thu, 25 Oct 2012 05:56:26 +0000 (UTC)
 (envelope-from kientzle@freebsd.org)
Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net
 [99.115.135.74])
 by mx1.freebsd.org (Postfix) with ESMTP id 19CC58FC0A
 for <arm@freebsd.org>; Thu, 25 Oct 2012 05:56:25 +0000 (UTC)
Received: (from root@localhost)
 by monday.kientzle.com (8.14.4/8.14.4) id q9P5uPPc034272;
 Thu, 25 Oct 2012 05:56:25 GMT (envelope-from kientzle@freebsd.org)
Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65])
 by kientzle.com with SMTP id t8keecbvjpvwqz7eyxdmhrqm96;
 Thu, 25 Oct 2012 05:56:24 +0000 (UTC)
 (envelope-from kientzle@freebsd.org)
Subject: Re: Trashed registers returning from kernel?
Mime-Version: 1.0 (Apple Message framework v1283)
Content-Type: text/plain; charset=us-ascii
From: Tim Kientzle <kientzle@freebsd.org>
In-Reply-To: <20121024233812.0eefd07f@fubar.geek.nz>
Date: Wed, 24 Oct 2012 22:56:22 -0700
Content-Transfer-Encoding: 7bit
Message-Id: <76909EA6-8373-4CF0-9F12-2FA7BBDC9722@freebsd.org>
References: <2B1CF099-50F0-46BE-8B02-61309DF93D5F@freebsd.org>
 <20121024233812.0eefd07f@fubar.geek.nz>
To: Andrew Turner <andrew@fubar.geek.nz>
X-Mailer: Apple Mail (2.1283)
Cc: arm@freebsd.org
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Porting FreeBSD to the StrongARM Processor <freebsd-arm.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Oct 2012 05:56:26 -0000


On Oct 24, 2012, at 3:38 AM, Andrew Turner wrote:

> On Sun, 21 Oct 2012 18:40:08 -0700
> Tim Kientzle <kientzle@freebsd.org> wrote:
> 
>> On the BeagleBone, I'm seeing a similar crash in several different
>> user land programs.  I suspect it's a kernel bug.
>> 
>> Symptom: program is killed with SIGSEGV.  Most of the registers
>> contain values above 0xc0000000 (pointing into kernel space).
>> 
>> Theory:
>> * Registers are not always getting correctly restored on a
>> kernel->user transition.
>> * SEGV is a consequence.
>> 
>> I can reproduce it semi-consistently by running "emacs existing-file"
>> just after a reboot.  (But I'm pretty sure this is the same symptoms
>> I've seen with several other programs, so I don't think it's a bug in
>> emacs.)
>> 
>> Has anyone else seen this on an armv6 system?
>> 
>> Does anyone have suggestions for how to go about debugging this?
>> 
>> Suggestions appreciated.
> 
> Can you find if the crash happens after a single syscall or is it
> after many different sys calls?

I've not managed to reproduce it running under ktrace.
There are a few consistencies that make me suspect
it's a single syscall.  (In emacs, it always happens just after
saving a file.)

But it's maddeningly infrequent, so I don't
think it's a consistent bug in a particular
syscall.  Rather, some occasional combination
of events is leading to a botched return to
userland.

> How consistent are the register values
> and instruction that causes the SEGV?

There are a few consistencies in the registers.

Don't know the instruction, though, because the PC
is trashed, too.  Sometimes the PC is null, sometimes
it's pointing to a structure in kernel space.  I don't
know the kernel code well enough to guess what
the structure is, though.

> Have you identified any other programs that have the same issue?

In emacs, it always happens just after saving a file.
Some of the registers contain addresses in witness
and mtx_assert code, and I've seen similar values in
a core dump I got from svn, so I think it affects svn
as well.

A while back I was seeing occasional crashes in install(1).
In that case, the PC was always pointing to code just
after a call to fchflags().   Unfortunately, I don't have
any of those core dumps handy right now so can't look
at the registers and see if there are any things in common.
Also, the bug I'm seeing in emacs and svn is trashing the
PC as well, so that's a little suspect.


> The relevant code to save the registers with system calls is in
> sys/arm/arm/exception.S and sys/arm/include/asmacros.h.
> 
> In exception.S there is the function swi_entry. It:
> - Saves the registers to the stack.
> - Stores sp in r0 to be passed in as the argument to swi_handler()
> - Stores sp in r6 to allow us to restore it later
> - Aligns the stack
> - Calls swi_handler() to perform the system call
> - Restores the stack pointer from r6
> - Performs any asynchronous software trap (calls ast() if required)
> - Restores the registers from the stack
> - Returns to userland
> 
> Assuming it is a syscall causing this I can think of 3 possible causes:
> 1. Someone is clobbering the stack.
> 2. Someone is clobbering the trap frame.
> 3. There is a cache issue causing old data to be written to the stack.
> 
> Checking 1 should be easy. In exception.S add the instruction "sub sp,
> sp, #32" before the bic instruction. This will add padding to the
> stack. You may need to change the #32 if it is not large enough. This
> won't help if the issue is in ast().

Thanks, Andrew.  You've given me a lot to work with.

One interesting observation:  I haven't seen a kernel
panic on BeagleBone in quite a while.  So the trap
frame seems a little more likely; I would expect a
stack or cache issue to sometimes panic the kernel.

Will take me a couple of weeks probably to follow up on
this.  (Not a lot of spare time after $DAYJOB.)

Tim