Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 15 Oct 2012 16:32:28 -0400
From:      "Dieter BSD" <dieterbsd@engineer.com>
To:        freebsd-hardware@freebsd.org
Subject:   Re: ahcich Timeouts SATA SSD
Message-ID:  <20121015203229.40280@gmx.com>

next in thread | raw e-mail | index | archive | help
> SSD are connected to on-board SATA port on motherboard

Presumably to controllers provided by the Intel Tylersburg 5520 chipset.

> This system was commissioned in February of 2012 and ran without issue
> as a ZFS backup system on our network until about 3 weeks ago.

> The system is dual PSU behind a UPS so I don't think that this is an issue.

No changes? e.g. no added hardware to increase power load.
Overloading the power supply and/or the wiring (with too many splitters)
can result in flaky problems like this.

> OS will respond to ping requests after the issue and if you have an
> active SSH session you will remain connected to the system until you
> attempt to do something like 'ls', 'ps', etc.

> I am not able to drop into DDB when the issue happens as the system is
> locked up completely. Could be a failure on my part to
> understand/engage in how to do this, will try if the issue happens
> again (should on Wednesday AM unless setting camcontrol apm to off for
> the disks somehow fixes the issue).

If the system is alive enough to respond to ping, I'd expect you
should be able to get into DDB? Can you get into DDB when the system
is working normally?

> 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
> 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap

> I ran the Crucial firmware update ISO and it did not see any firmware
> updates as necessary on the SSD disks.

Does the problem happen with both the Crucial and the Intel SSDs?

> If software I agree that it would not make sense that this would
> suddenly pop-up after months of operation with no issues.

If something causes the software/firmware to take a different
path, new issues can appear. E.g. error handling or even timing.
Infrequently used code paths might not have been tested sufficiently.

Does the controller have firmware? Part of the BIOS I suppose.
Is there a BIOS update available? Have you considered connecting the
SSDs to a different controller?

> the on-board AHCI portion of the BIOS does
> not always see the disks after the event without a hard system power
> reset.

That's at least one bug somewhere, probably the hardware isn't getting reset
properly. Does Supermicro know about this bug?

> I have 48 Gb of Crucial memory that I will put in this system today to
> replace the 24 Gb or so of Kingston memory I have in the system.

Which in addition to being different memory, should reduce swap activity.

Suggestion: move everything to conventional drives. Keep at least one
SSD connected to system, but normally unused. Now you can beat on the
SSD in a controlled manner to debug the problem. Does reading trigger
the problem? Writing? Try dd with different blocksizes, accessing
multiple SSDs at once, etc. I have to wonder if there is a timing problem,
or missing interrupt, or...

> * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
> purpose of this system

If it fails with FreeBSD but works with Solaris on the same hardware,
then it is almost certainly a problem with the device driver. (Or
at least a problem that Solaris has a workaround for.)



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121015203229.40280>