Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Oct 2012 12:45:12 -0700
From:      nate keegan <nate.keegan@gmail.com>
To:        freebsd-hardware@freebsd.org
Subject:   Re: ahcich Timeouts SATA SSD
Message-ID:  <CABVjXffrpqhr-JwJb%2BKjzDGwjGEKFgXZVkrXvsKdRdkeHeL6xw@mail.gmail.com>
In-Reply-To: <CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A@mail.gmail.com>
References:  <20121015203229.40280@gmx.com> <CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Since replacing the SSD disks with good old plain SATA in external
enclosures I have not experienced a single issue.

I can only surmise that something is wonky with the Crucial M4
firmware with FreeBSD 8.2/9.0 under certain circumstances.

Thanks to everyone who contributed on this as the information about
debugging kernels, etc was very helpful from a procedural point of
view.


On Tue, Oct 16, 2012 at 12:48 PM, nate keegan <nate.keegan@gmail.com> wrote:
> I'm only seeing gstat output of a few percentage points for the OS disks.
>
> I am using ECC memory (both the Kingston and the new Crucial memory)
> and went ahead and swapped out the SSD for SATA disks this morning.
>
> Since both SSD were the same firmware and type/manufacturer I figured
> it was a good time to address this variable.
>
> I also went ahead and put in a serial console server this morning so I
> have proper console access instead of relying on the Supermicro iLO
> utility.
>
> Will keep an eye on the pure SATA setup to see if it barfs or not.
> Will try to gather some ddb(4) information if it does barf again.
>
>
> On Mon, Oct 15, 2012 at 1:32 PM, Dieter BSD <dieterbsd@engineer.com> wrote:
>>> SSD are connected to on-board SATA port on motherboard
>>
>> Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
>>
>>> This system was commissioned in February of 2012 and ran without issue
>>> as a ZFS backup system on our network until about 3 weeks ago.
>>
>>> The system is dual PSU behind a UPS so I don't think that this is an issue.
>>
>> No changes? e.g. no added hardware to increase power load.
>> Overloading the power supply and/or the wiring (with too many splitters)
>> can result in flaky problems like this.
>>
>>> OS will respond to ping requests after the issue and if you have an
>>> active SSH session you will remain connected to the system until you
>>> attempt to do something like 'ls', 'ps', etc.
>>
>>> I am not able to drop into DDB when the issue happens as the system is
>>> locked up completely. Could be a failure on my part to
>>> understand/engage in how to do this, will try if the issue happens
>>> again (should on Wednesday AM unless setting camcontrol apm to off for
>>> the disks somehow fixes the issue).
>>
>> If the system is alive enough to respond to ping, I'd expect you
>> should be able to get into DDB? Can you get into DDB when the system
>> is working normally?
>>
>>> 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
>>> 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
>>
>>> I ran the Crucial firmware update ISO and it did not see any firmware
>>> updates as necessary on the SSD disks.
>>
>> Does the problem happen with both the Crucial and the Intel SSDs?
>>
>>> If software I agree that it would not make sense that this would
>>> suddenly pop-up after months of operation with no issues.
>>
>> If something causes the software/firmware to take a different
>> path, new issues can appear. E.g. error handling or even timing.
>> Infrequently used code paths might not have been tested sufficiently.
>>
>> Does the controller have firmware? Part of the BIOS I suppose.
>> Is there a BIOS update available? Have you considered connecting the
>> SSDs to a different controller?
>>
>>> the on-board AHCI portion of the BIOS does
>>> not always see the disks after the event without a hard system power
>>> reset.
>>
>> That's at least one bug somewhere, probably the hardware isn't getting reset
>> properly. Does Supermicro know about this bug?
>>
>>> I have 48 Gb of Crucial memory that I will put in this system today to
>>> replace the 24 Gb or so of Kingston memory I have in the system.
>>
>> Which in addition to being different memory, should reduce swap activity.
>>
>> Suggestion: move everything to conventional drives. Keep at least one
>> SSD connected to system, but normally unused. Now you can beat on the
>> SSD in a controlled manner to debug the problem. Does reading trigger
>> the problem? Writing? Try dd with different blocksizes, accessing
>> multiple SSDs at once, etc. I have to wonder if there is a timing problem,
>> or missing interrupt, or...
>>
>>> * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
>>> purpose of this system
>>
>> If it fails with FreeBSD but works with Solaris on the same hardware,
>> then it is almost certainly a problem with the device driver. (Or
>> at least a problem that Solaris has a workaround for.)
>> _______________________________________________
>> freebsd-hardware@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
>> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABVjXffrpqhr-JwJb%2BKjzDGwjGEKFgXZVkrXvsKdRdkeHeL6xw>