Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 05 Jul 2011 17:41:11 +0200
From:      Christian Baer <christian.baer@uni-dortmund.de>
To:        freebsd-stable@freebsd.org
Subject:   Re: Crashes with Promise controller
Message-ID:  <iuvbao$l84$1@dough.gmane.org>
In-Reply-To: <20110618175215.GA18645@icarus.home.lan>
References:  <it56el$tqa$1@dough.gmane.org>	<52F39CE0-EEC7-4180-8186-BF8696AF279D@lassitu.de> <20110618175215.GA18645@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18.06.2011 19:52, Jeremy Chadwick wrote:

> It may be that the kernel is panic'ing and auto-rebooting before he can
> see the message in question.  I would advocate he put the following
> directives in his kernel configuration and rebuild/reinstall kernel and
> wait for it to happen again.

I have now changed the power setup slightly and the problems have
*reduced* and slightly changed in themselves. Reproducing a panic is a
lot harder, which I consider a good thing at the moment.

Since I changed the power configuration, the system has been running for
about 4 days and had only two crashes (traps) since then, despite quite
heavy traffic on the drives. Because the system rebooted very quickly
before I set up the serial console, I only ever got to see one panic
(not a trap) in the past. But it was gone to quickly for me to write
anything down about it.

On a side-note:
I did find out during my testing (before changing the power) that two
drives were actually causing the problems and I could even make the
system crash while only reading from one of those drives. Crashes while
reading felt less frequent (no statistics collected though) but happened
just the same.

Because I formatted the two drives in question with rather strange
values (rather large block sizes), I have decided to copy everything off
them, re-partition them with gpt and create both the encryption-system
on them aswell as the file system over.

During this copying, I managed to crash the system twice. The first time
was yesterday, where I got this:

--- snip ---
Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x1f8
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc3d2120c
stack pointer           = 0x28:0xc3697bf4
frame pointer           = 0x28:0xc3697c4c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (g_event)
[thread pid 2 tid 100007 ]
Stopped at      g_eli_access+0x7c:      testl   $0x10008,0x1f8(%ebx)
--- snap ---

About 25 minutes ago, the system crashed again. This time, I had the
"known" errors prior to the actual trap:

--- snip ---
ata6: SIGNATURE: ffffffff
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES SET TRANSFER MODE command
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES ENABLE RCACHE command
ata6: timeout waiting to issue command
ata6: error issuing SETFEATURES ENABLE WCACHE command
ata6: timeout waiting to issue command
ata6: error issuing SET_MULTI command
ad12: FAILURE - device detached
GEOM_ELI: g_eli_read_done() failed ad12d.eli[READ(offset=403810975744,
length=32768)]
g_vfs_done():ad12d.eli[READ(offset=403810975744, length=32768)]error = 6

Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x1f8
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc3d2420c
stack pointer           = 0x28:0xc3697bf4
frame pointer           = 0x28:0xc3697c4c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (g_event)
[thread pid 2 tid 100007 ]
Stopped at      g_eli_access+0x7c:      testl   $0x10008,0x1f8(%ebx)
--- snap ---

The strange thing is that I wasn't actually accessing ad12 at the time.
I was running a "-t long" on it, but no more. That test had been running
for over two hours at the time of the crash.

Does this still somehow point to a power problem (since ad12 seems to
get detached)? Or could is be something a bit more fundamental?

Best regards,
Chris




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?iuvbao$l84$1>