From owner-freebsd-stable@FreeBSD.ORG  Tue Jul  5 15:41:31 2011
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4608D106566B
	for <freebsd-stable@freebsd.org>; Tue,  5 Jul 2011 15:41:30 +0000 (UTC)
	(envelope-from freebsd-stable@m.gmane.org)
Received: from lo.gmane.org (lo.gmane.org [80.91.229.12])
	by mx1.freebsd.org (Postfix) with ESMTP id CAF7D8FC13
	for <freebsd-stable@freebsd.org>; Tue,  5 Jul 2011 15:41:29 +0000 (UTC)
Received: from list by lo.gmane.org with local (Exim 4.69)
	(envelope-from <freebsd-stable@m.gmane.org>) id 1Qe7kb-0006oK-Vl
	for freebsd-stable@freebsd.org; Tue, 05 Jul 2011 17:41:25 +0200
Received: from dtmd-4db2d4d7.pool.mediaways.net ([77.178.212.215])
	by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
	id 1AlnuQ-0007hv-00
	for <freebsd-stable@freebsd.org>; Tue, 05 Jul 2011 17:41:25 +0200
Received: from christian.baer by dtmd-4db2d4d7.pool.mediaways.net with local
	(Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00
	for <freebsd-stable@freebsd.org>; Tue, 05 Jul 2011 17:41:25 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: freebsd-stable@freebsd.org
From: Christian Baer <christian.baer@uni-dortmund.de>
Date: Tue, 05 Jul 2011 17:41:11 +0200
Lines: 89
Message-ID: <iuvbao$l84$1@dough.gmane.org>
References: <it56el$tqa$1@dough.gmane.org>	<52F39CE0-EEC7-4180-8186-BF8696AF279D@lassitu.de>
	<20110618175215.GA18645@icarus.home.lan>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@dough.gmane.org
X-Gmane-NNTP-Posting-Host: dtmd-4db2d4d7.pool.mediaways.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
	rv:1.9.1.16) Gecko/20101125 Lightning/1.0b1 Thunderbird/3.0.11
In-Reply-To: <20110618175215.GA18645@icarus.home.lan>
Subject: Re: Crashes with Promise controller
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Jul 2011 15:41:31 -0000

On 18.06.2011 19:52, Jeremy Chadwick wrote:

> It may be that the kernel is panic'ing and auto-rebooting before he can
> see the message in question.  I would advocate he put the following
> directives in his kernel configuration and rebuild/reinstall kernel and
> wait for it to happen again.

I have now changed the power setup slightly and the problems have
*reduced* and slightly changed in themselves. Reproducing a panic is a
lot harder, which I consider a good thing at the moment.

Since I changed the power configuration, the system has been running for
about 4 days and had only two crashes (traps) since then, despite quite
heavy traffic on the drives. Because the system rebooted very quickly
before I set up the serial console, I only ever got to see one panic
(not a trap) in the past. But it was gone to quickly for me to write
anything down about it.

On a side-note:
I did find out during my testing (before changing the power) that two
drives were actually causing the problems and I could even make the
system crash while only reading from one of those drives. Crashes while
reading felt less frequent (no statistics collected though) but happened
just the same.

Because I formatted the two drives in question with rather strange
values (rather large block sizes), I have decided to copy everything off
them, re-partition them with gpt and create both the encryption-system
on them aswell as the file system over.

During this copying, I managed to crash the system twice. The first time
was yesterday, where I got this:

--- snip ---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x1f8
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc3d2120c
stack pointer           = 0x28:0xc3697bf4
frame pointer           = 0x28:0xc3697c4c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (g_event)
[thread pid 2 tid 100007 ]
Stopped at      g_eli_access+0x7c:      testl   $0x10008,0x1f8(%ebx)
--- snap ---

About 25 minutes ago, the system crashed again. This time, I had the
"known" errors prior to the actual trap:

--- snip ---
ata6: SIGNATURE: ffffffff
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES SET TRANSFER MODE command
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES ENABLE RCACHE command
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES ENABLE WCACHE command
ata6: timeout waiting to issue command
ata6: error issuing SET_MULTI command
ad12: FAILURE - device detached
GEOM_ELI: g_eli_read_done() failed ad12d.eli[READ(offset=403810975744,
length=32768)]
g_vfs_done():ad12d.eli[READ(offset=403810975744, length=32768)]error = 6

Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x1f8
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc3d2420c
stack pointer           = 0x28:0xc3697bf4
frame pointer           = 0x28:0xc3697c4c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (g_event)
[thread pid 2 tid 100007 ]
Stopped at      g_eli_access+0x7c:      testl   $0x10008,0x1f8(%ebx)
--- snap ---

The strange thing is that I wasn't actually accessing ad12 at the time.
I was running a "-t long" on it, but no more. That test had been running
for over two hours at the time of the crash.

Does this still somehow point to a power problem (since ad12 seems to
get detached)? Or could is be something a bit more fundamental?

Best regards,
Chris