Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Oct 2012 07:54:21 -0700
From:      nate keegan <nate.keegan@gmail.com>
To:        freebsd-hardware@freebsd.org
Subject:   Re: ahcich Timeouts SATA SSD
Message-ID:  <CABVjXfceHC3s0u6pMBWcPb1XqTrvVW52FN9G3A1Oh1F-UUVqNQ@mail.gmail.com>
In-Reply-To: <20121015095858.GC33428@server.rulingia.com>
References:  <CABVjXfeV9VvF6sJC3Tb78z=jP%2B2sF%2BOJ2q0euCZkNqN_Yjs9ag@mail.gmail.com> <20121015095858.GC33428@server.rulingia.com>

next in thread | previous in thread | raw e-mail | index | archive | help
The system is dual PSU behind a UPS so I don't think that this is an issue.

My notes show that we replaced one of the DIMMs on this system a few
months ago as it was detected as bad during a POST.

During the cycle of reboots that I have taken on with testing
resolutions to this issue I have seen a single time where the BIOS
detected a bad DIMM but only one time.

I do have a complete set of replacement memory (Crucial vs Kingston
that is in the system now) and will swap out the memory in case one of
the DIMMs is flaky but not poor enough for the BIOS to notice on a
consistent basis.

I am not able to drop into DDB when the issue happens as the system is
locked up completely. Could be a failure on my part to
understand/engage in how to do this, will try if the issue happens
again (should on Wednesday AM unless setting camcontrol apm to off for
the disks somehow fixes the issue).

I am running GENERIC kernel and have not set any loader tunables or
sysctls other than that related to addressing this issue (SATA power
management, AHCI, etc).

The problem first started around the time when we setup pool scrubbing
and at that time it was a single instance which seemed to be tied to
the bad DIMM. Have not run pool scrubbing since that time.

Will get the output of gstat -a and post it up here.

Will upgrade to FreeBSD 9.1RC2 today and compile kernel with the
options you suggested.

I already went ahead and removed the L2ARC and one of the OS SSD
drives to simplify things - now I have 1 x SSD with OS and 1 x SSD for
swap and that is it.

I ran the Crucial firmware update ISO and it did not see any firmware
updates as necessary on the SSD disks.

I appreciate the feedback as part of the difficulty here has been
making a determination of whether this is software/driver or hardware.
If software I agree that it would not make sense that this would
suddenly pop-up after months of operation with no issues.


> Are you running a GENERIC kernel?  If not, what changes have you made?
> Have you set any loader tunables or sysctls?
> Have you scrubbed the pools?
> If you run "gstat -a", do any devices have anomolous readings?
>
> I can't offer any definite fixes but can suggest a few more things to
> try:
> 1) Try FreeBSD-9.1RC2 and see if the problem persists.
> 2) Try a new kernel with
>      options WITNESS
>      options WITNESS_SKIPSPIN
>    this may make a software bug more obvious (but will somewhat increase
>    kernel overheads)
> 3) If you can afford it, detach the L2ARC - which removes one potential issue.
> 4) If you haven't already, build a kernel with
>      makeoptions DEBUG=-g
>      options KDB
>      options KDB_TRACE
>      options KDB_UNATTENDED
>      options DDB
>    this won't have any impact on normal operation but will simplify debugging.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABVjXfceHC3s0u6pMBWcPb1XqTrvVW52FN9G3A1Oh1F-UUVqNQ>