Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 16 Oct 2012 12:48:16 -0700
From:      nate keegan <nate.keegan@gmail.com>
To:        freebsd-hardware@freebsd.org
Subject:   Re: ahcich Timeouts SATA SSD
Message-ID:  <CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A@mail.gmail.com>
In-Reply-To: <20121015203229.40280@gmx.com>
References:  <20121015203229.40280@gmx.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I'm only seeing gstat output of a few percentage points for the OS disks.

I am using ECC memory (both the Kingston and the new Crucial memory)
and went ahead and swapped out the SSD for SATA disks this morning.

Since both SSD were the same firmware and type/manufacturer I figured
it was a good time to address this variable.

I also went ahead and put in a serial console server this morning so I
have proper console access instead of relying on the Supermicro iLO
utility.

Will keep an eye on the pure SATA setup to see if it barfs or not.
Will try to gather some ddb(4) information if it does barf again.


On Mon, Oct 15, 2012 at 1:32 PM, Dieter BSD <dieterbsd@engineer.com> wrote:
>> SSD are connected to on-board SATA port on motherboard
>
> Presumably to controllers provided by the Intel Tylersburg 5520 chipset.
>
>> This system was commissioned in February of 2012 and ran without issue
>> as a ZFS backup system on our network until about 3 weeks ago.
>
>> The system is dual PSU behind a UPS so I don't think that this is an issue.
>
> No changes? e.g. no added hardware to increase power load.
> Overloading the power supply and/or the wiring (with too many splitters)
> can result in flaky problems like this.
>
>> OS will respond to ping requests after the issue and if you have an
>> active SSH session you will remain connected to the system until you
>> attempt to do something like 'ls', 'ps', etc.
>
>> I am not able to drop into DDB when the issue happens as the system is
>> locked up completely. Could be a failure on my part to
>> understand/engage in how to do this, will try if the issue happens
>> again (should on Wednesday AM unless setting camcontrol apm to off for
>> the disks somehow fixes the issue).
>
> If the system is alive enough to respond to ping, I'd expect you
> should be able to get into DDB? Can you get into DDB when the system
> is working normally?
>
>> 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot)
>> 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap
>
>> I ran the Crucial firmware update ISO and it did not see any firmware
>> updates as necessary on the SSD disks.
>
> Does the problem happen with both the Crucial and the Intel SSDs?
>
>> If software I agree that it would not make sense that this would
>> suddenly pop-up after months of operation with no issues.
>
> If something causes the software/firmware to take a different
> path, new issues can appear. E.g. error handling or even timing.
> Infrequently used code paths might not have been tested sufficiently.
>
> Does the controller have firmware? Part of the BIOS I suppose.
> Is there a BIOS update available? Have you considered connecting the
> SSDs to a different controller?
>
>> the on-board AHCI portion of the BIOS does
>> not always see the disks after the event without a hard system power
>> reset.
>
> That's at least one bug somewhere, probably the hardware isn't getting reset
> properly. Does Supermicro know about this bug?
>
>> I have 48 Gb of Crucial memory that I will put in this system today to
>> replace the 24 Gb or so of Kingston memory I have in the system.
>
> Which in addition to being different memory, should reduce swap activity.
>
> Suggestion: move everything to conventional drives. Keep at least one
> SSD connected to system, but normally unused. Now you can beat on the
> SSD in a controlled manner to debug the problem. Does reading trigger
> the problem? Writing? Try dd with different blocksizes, accessing
> multiple SSDs at once, etc. I have to wonder if there is a timing problem,
> or missing interrupt, or...
>
>> * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended
>> purpose of this system
>
> If it fails with FreeBSD but works with Solaris on the same hardware,
> then it is almost certainly a problem with the device driver. (Or
> at least a problem that Solaris has a workaround for.)
> _______________________________________________
> freebsd-hardware@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A>