Date: Fri, 09 Nov 2012 05:26:05 -0500 From: Lanny Baron <lnb@freebsdsystems.com> To: nate keegan <nate.keegan@gmail.com> Cc: freebsd-hardware@freebsd.org Subject: Re: ahcich Timeouts SATA SSD Message-ID: <509CDA3D.5090509@freebsdsystems.com> In-Reply-To: <CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A@mail.gmail.com> References: <20121015203229.40280@gmx.com> <CABVjXfePQvNs8NZnUgO5ZCBT0dAcn1SfkihtCE1wQjwou-Oj7A@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, I don't know how far apart you added memory from the time you bought/built your server. I say that because the drams on the memory might be slightly different. When we build servers, we use a particular brand for certain reasons but one of those reason is the fact the dram specs do not change on a given sku. Here is what I recommend you try. Take out all the memory. Add one dimm only. See if problem persists. If problem stops, add second dimm. Still good, add 3rd dimm and keep adding another dimm one by one. When the problem comes back. Remove all dimms again and put the last dimm you added where the problem came back in first slot. If the problem persists, you found the winner. If not, add all dimms back except the one you just tested and use another dimm. If problem persists, you found a bad memory slot. It's a real PITA <tm> but that is the only way to find the issue if it is indeed memory or a bad memory slot. One more thing you should try. Did you enable IPMI? If so, #ipmitool -H x.x.x.x sel list Take a look at the output. If you did not enable IPMI (ipadd/netmask/gateway), the bios should have a place to do so. Sorry, we don't sell/build supermicro* so I am unfamiliar with those boards. If you are using both kingston/crucial, just use one of those, do not mix them. Hope this can help you out. Lanny Servaris Corporation http://www.servaris.com On 10/16/2012 3:48 PM, nate keegan wrote: > I'm only seeing gstat output of a few percentage points for the OS disks. > > I am using ECC memory (both the Kingston and the new Crucial memory) > and went ahead and swapped out the SSD for SATA disks this morning. > > Since both SSD were the same firmware and type/manufacturer I figured > it was a good time to address this variable. > > I also went ahead and put in a serial console server this morning so I > have proper console access instead of relying on the Supermicro iLO > utility. > > Will keep an eye on the pure SATA setup to see if it barfs or not. > Will try to gather some ddb(4) information if it does barf again. > > > On Mon, Oct 15, 2012 at 1:32 PM, Dieter BSD <dieterbsd@engineer.com> wrote: >>> SSD are connected to on-board SATA port on motherboard >> >> Presumably to controllers provided by the Intel Tylersburg 5520 chipset. >> >>> This system was commissioned in February of 2012 and ran without issue >>> as a ZFS backup system on our network until about 3 weeks ago. >> >>> The system is dual PSU behind a UPS so I don't think that this is an issue. >> >> No changes? e.g. no added hardware to increase power load. >> Overloading the power supply and/or the wiring (with too many splitters) >> can result in flaky problems like this. >> >>> OS will respond to ping requests after the issue and if you have an >>> active SSH session you will remain connected to the system until you >>> attempt to do something like 'ls', 'ps', etc. >> >>> I am not able to drop into DDB when the issue happens as the system is >>> locked up completely. Could be a failure on my part to >>> understand/engage in how to do this, will try if the issue happens >>> again (should on Wednesday AM unless setting camcontrol apm to off for >>> the disks somehow fixes the issue). >> >> If the system is alive enough to respond to ping, I'd expect you >> should be able to get into DDB? Can you get into DDB when the system >> is working normally? >> >>> 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) >>> 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap >> >>> I ran the Crucial firmware update ISO and it did not see any firmware >>> updates as necessary on the SSD disks. >> >> Does the problem happen with both the Crucial and the Intel SSDs? >> >>> If software I agree that it would not make sense that this would >>> suddenly pop-up after months of operation with no issues. >> >> If something causes the software/firmware to take a different >> path, new issues can appear. E.g. error handling or even timing. >> Infrequently used code paths might not have been tested sufficiently. >> >> Does the controller have firmware? Part of the BIOS I suppose. >> Is there a BIOS update available? Have you considered connecting the >> SSDs to a different controller? >> >>> the on-board AHCI portion of the BIOS does >>> not always see the disks after the event without a hard system power >>> reset. >> >> That's at least one bug somewhere, probably the hardware isn't getting reset >> properly. Does Supermicro know about this bug? >> >>> I have 48 Gb of Crucial memory that I will put in this system today to >>> replace the 24 Gb or so of Kingston memory I have in the system. >> >> Which in addition to being different memory, should reduce swap activity. >> >> Suggestion: move everything to conventional drives. Keep at least one >> SSD connected to system, but normally unused. Now you can beat on the >> SSD in a controlled manner to debug the problem. Does reading trigger >> the problem? Writing? Try dd with different blocksizes, accessing >> multiple SSDs at once, etc. I have to wonder if there is a timing problem, >> or missing interrupt, or... >> >>> * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended >>> purpose of this system >> >> If it fails with FreeBSD but works with Solaris on the same hardware, >> then it is almost certainly a problem with the device driver. (Or >> at least a problem that Solaris has a workaround for.) >> _______________________________________________ >> freebsd-hardware@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hardware >> To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-hardware@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?509CDA3D.5090509>