From owner-freebsd-hardware@FreeBSD.ORG Sun Oct 14 10:29:53 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D15E5CC for ; Sun, 14 Oct 2012 10:29:53 +0000 (UTC) (envelope-from litelwang@126.com) Received: from m15-64.126.com (m15-64.126.com [220.181.15.64]) by mx1.freebsd.org (Postfix) with ESMTP id 32E968FC19 for ; Sun, 14 Oct 2012 10:29:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Received:Date:From:To:Subject:Content-Type: MIME-Version:Message-ID; bh=4FZPoYYeqKuRBFXWxT2MBWJWRdW/9hqsjk9T JjBiaNs=; b=OdcUFHF6da6vy9P7H91FN8qPb4tYRioYNuHbaYMe1iLMDK08A4yO 2eeSIP+pNPm7gqdxFhdPDRKKsQVgBgo4qHWYfIwPJHGwXesPGCYibx5HF0XCAIdN 2gM6eCrM5Gyded+Bcc6oC7DLiDiHQCiGFBXC6Ny4u8b2bCrdaP6e3KU= Received: from litelwang$126.com ( [116.238.78.107] ) by ajax-webmail-wmsvr64 (Coremail) ; Sun, 14 Oct 2012 18:28:56 +0800 (CST) X-Originating-IP: [116.238.78.107] Date: Sun, 14 Oct 2012 18:28:56 +0800 (CST) From: LW To: freebsd-hardware@freebsd.org Subject: Install Program can't boot after AMD-A75 onboard-raid is activated X-Priority: 3 X-Mailer: Coremail Webmail Server Version SP_ntes V3.5 build 20120914(19817.4926.4909) Copyright (c) 2002-2012 www.mailtech.cn 126com MIME-Version: 1.0 Message-ID: <284a4ec5.a626.13a5ed1c3c5.Coremail.litelwang@126.com> X-CM-TRANSID: QMqowED5nkPpk3pQBscZAA--.188W X-CM-SenderInfo: polwvzpzdqwqqrswhudrp/1tbitAFICkX9jmkAlwABs+ X-Coremail-Antispam: 1U5529EdanIXcx71UUUUU7vcSsGvfC2KfnxnUU== Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: base64 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Oct 2012 10:29:53 -0000 SGFyZHdhcmUgbGlzdDoKMeOAgU1vdGhlcmJvYXJkKDEpOk1TSSBBNzVNQS1QMzUsd2l0aCBsYXRl c3QgQklPUy1VRUZJKEFNRCBBNzUgY2hpcHNldCxzdXBwb3J0IHNhdGEz44CBdXNiM+OAgXNvY2tl dCBGTTHjgIFkZHIz44CBVUVGSSkKMuOAgUNQVSgxKTpBTUQgQTQtMzQwMO+8iHNvY2tldCBGTTHv vIkKM+OAgU1lbW9yeSgxKTpERFIzIDE2MDAgNEdCCjTjgIFIYXJkZGlzaygyKTpTZWFnYXRlIHNh dGEzIDUwMEdCCjXjgIFVU0IgRFZEIFJPTSgxKQpQcm9ibGVtIGRlc2NyaXB0aW9uOgox44CBSWYg SURFIG9yIEFIQ0kgbW9kZSBpcyBhY3RpdmF0ZWQsZXJ2ZXJ5dGhpbmcgaXMgT0sgYW5kIGl0IHJ1 bnMgZmFzdCAuCjLjgIFJZiBSQUlEIG1vZGUgaXMgYWN0aXZhdGVkKFdJVEhPVVQgYW55IHJhaWQg ZGVmaW5lZCBvciB3aXRoIHJhaWQxIGRlZmluZWQpLEJvb3RMb2FkZXIoSW5zdGFsbCBDRCBvciBE VkQsMzIgb3IgNjQgdmVyc2lvbixGQjkgUmVsZWFzZSkgY2FuIHJ1biBidXQgd2lsbCByZWJvb3Qg dmVyeSBzb29uIC5Pbmx5IGFib3V0IDMtbGluZXMgbWVzc2FnZSB3ZXJlIHNob3dlZCAuCgpJIGhh dmUgcmVhZCBzb21lIG1lc3NhZ2VzIGluIHRoaXMgbWFpbCBsaXN0IGJ1dCBmb3VuZCBub3RoaW5n IGltcG9ydCB0byBzb2x2ZSBteSBwcm9ibGVtIOOAglRoYW5rcyEK From owner-freebsd-hardware@FreeBSD.ORG Sun Oct 14 23:03:47 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4FA21779 for ; Sun, 14 Oct 2012 23:03:47 +0000 (UTC) (envelope-from nate.keegan@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 03CE88FC18 for ; Sun, 14 Oct 2012 23:03:46 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id v11so5894760vbm.13 for ; Sun, 14 Oct 2012 16:03:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=70JEa6++7x9vgyBb1iUjHVNarKo1siY1B85qA0XWc/4=; b=fEoX+HnLwHBtR5AURDbhcFGo7FWmsPM+716992KrJqaTpbQYeOIMU31XxpSMUs0EON 1d3PsywHGW+5QCBkHhlSvdPeTSC9hCEb/mC9rNptMi19W+NPQ6ejHIf0WX+dFYJjeLmj s3cqkg122Wyeg/Su+6wHeh4CiwttG2580CZy5eiHp7eu8wz1JJOE5Eb8p+EIH+gU5EG4 yoPYAIJZP+c/U935Iz6Lzqu+xCM6D1JXtSlj47ItrFBtTizcw1advjQ5qmzkHeTbIR1o zC3BDjzR7aaGSZAMHlceJzCjp42asjRmezcwg452L2WKuJvfqf6mQ+7oTj0fb57xzRT3 EPaA== MIME-Version: 1.0 Received: by 10.221.2.76 with SMTP id nt12mr5760534vcb.12.1350255819966; Sun, 14 Oct 2012 16:03:39 -0700 (PDT) Received: by 10.58.240.42 with HTTP; Sun, 14 Oct 2012 16:03:39 -0700 (PDT) Date: Sun, 14 Oct 2012 16:03:39 -0700 Message-ID: Subject: ahcich Timeouts SATA SSD From: nate keegan To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Oct 2012 23:03:47 -0000 I originally posted this to the FreeBSD hardware forum and then on freebsd-questions at the direction of a moderator in the forum. Based on what I'm seeing for post types on freebsd-questions this might be the best forum for this issue as it looks like some sort of a strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives. My configuration is as follows: FreeBSD 8.2-RELEASE Supermicro X8DTi-LN4F (Intel Tylersburg 5520 chipset) motherboard 24 GB system memory 32 x Hitachi Deskstar 5K3000 disks connected to 4 x Intel SASUC8I (LSI 3081E-R) in IT mode 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap SSD are connected to on-board SATA port on motherboard This system was commissioned in February of 2012 and ran without issue as a ZFS backup system on our network until about 3 weeks ago. At that time I started getting kernel panics due to timeouts to the on-board SATA devices. The only change to the system since it was built was to add an SSD for swap (32 Gb swap device) and this issue did not happen until several months after this was added. My initial thought was that I might have a bad SSD drive so I swapped out one of the Crucial SSD drives and the problem happened again a few days later. I then moved to systematically replacing items such as SATA cables, memory, motherboard, etc and the problem continued. For example, I swapped out the 4 SATA cables with brand new SATA cables and waited to see if the problem happened again. Once it did I moved on to replacing the motherboard with an identical motherboard, waited, etc. I could not find an obvious hardware related explanation for this behavior so about a week and a half ago I did a fresh install of FreeBSD 9.0-RELEASE to move from the ATA driver to the AHCI driver as I found some evidence that this was helpful. The problem continued with something like this: ahcich0: Timeout on slot 29 port 0 ahcich0: is 000000000 cs 00000000 ss e0000000 rs e0000000 tfd 40 serr 00000000 cmd 0004df17 ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080) ahcich0: Timeout on slot 31 port 0 ahcich0: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17 (ada0:ahcich0:0:0:0): lost device ahcich0: AHCI reset: device not ready after 3100ms (tfd = 00000080) ahcich0: Timeout on slot 31 port 0 ahcich0: is 00000000 cs 80000003 ss 800000003 rs 80000003 tfd 80 serr 0000000 cmd 0004df17 (ada0:ahcich0:0:0:0): removing device entry ahcich0: AHCI reset: device not ready after 31000ms (tfd = 00000080) ahcich0: Poll timeout on slot 1 port 0 ahcich0: is 00000000 cs 00000002 ss 000000000 rs 0000002 tfd 80 serr 00000000 cmd 004c117 When this happens the only way to recover the system is to hard boot via IPMI (yanking the power vs hitting reset). I cannot say that every time this happens a hard reset is necessary but more often than not a hard reset is necessary as the on-board AHCI portion of the BIOS does not always see the disks after the event without a hard system power reset. I have done a bunch of Google work on this and have seen the issue appear in FreeNAS and FreeBSD but no clear cut resolution in terms of how to address it or what causes it. Some people had a bad SSD, others had to disable NCQ or power management on their SSD, particular brands of SSD (Samsung), etc. Nothing conclusive so far. At the present time the issue happens every 1-2 hours unless I have the following in my /boot/loader.conf after the ahci_load statement: ahci_load="YES" # See ahci(4) hint.ahcich.0.sata_rev=1 hint.ahcich.1.sata_rev=1 hint.ahcich.2.sata_rev=1 hint.ahcich.3.sata_rev=1 hint.ahcich.0.pm_level=1 hint.ahcich.1.pm_level=1 hint.ahcich.2.pm_level=1 hint.ahcich.3.pm_level=1 I have a script in /usr/local/etc/rc.d which disables NCQ on these drives: #!/bin/sh CAMCONTROL=/sbin/camcontrol $CAMCONTROL tags ada0 -N 1 > /dev/null $CAMCONTROL tags ada1 -N 1 > /dev/null $CAMCONTROL tags ada2 -N 1 > /dev/null $CAMCONTROL tags ada3 -N 1 > /dev/null exit 0 I went ahead and pulled the Intel SSDs as they were showing ASR and hardware resets which incremented. Removing both of these disks from the system did not change the situation. The combination of /boot/loader.conf and this script gets me 6 days or so of operation before the issue pops up again. If I remove these two items I get maybe 2 hours before the issue happens again. Right now I'm down to one OS disk and one swap disk and that is it for SSD disks on the system. At the last reboot (yesterday) I disabled APM on the disks (ada0 and ada1 at this point) to see if that makes a difference as I found a reference to this being a potential problem. I'm looking for insight/help on this as I'm about out of options. If there is a way to gather more information when this happens, post up information, etc I'm open to trying it. What is driving me crazy is that I can't seem to come up with a concrete explanation as to why now and not back when the system was built. The issue only seems to happen when the system is idle and the SSD drives do not see much action other than to host OS, scripts, etc while the Intel/LSI based drives is where the actual I/O is at. The system logs do not show anything prior to event happening and the OS will respond to ping requests after the issue and if you have an active SSH session you will remain connected to the system until you attempt to do something like 'ls', 'ps', etc. New SSH requests to the system get 'connection refused'. As far as I can see I have three real options left: * Hope that someone here knows something I don't * Ditch SSD for straight SATA disks (plan on doing this next week before next likely happening sometime Wed am) as perhaps there is some odd SATA/SSD interaction with FreeBSD or with controller I'm not aware of (haven't had this happen with plain SATA and FreeBSD before) * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpose of this system I'm open to suggestions, direction, etc to see if I can nail down what is going on and put this issue to bed for not only myself but for anyone else who might run into it in the future. From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 03:16:28 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A77345B8 for ; Mon, 15 Oct 2012 03:16:28 +0000 (UTC) (envelope-from info@didierpaulassociates.com) Received: from mailgw18.surf-town.net (mail8.surf-town.net [212.97.132.48]) by mx1.freebsd.org (Postfix) with ESMTP id 4042A8FC08 for ; Mon, 15 Oct 2012 03:16:27 +0000 (UTC) Received: by mailgw18.surf-town.net (Postfix, from userid 65534) id 0BBB711B8D1; Mon, 15 Oct 2012 05:16:27 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by mailgw18.surf-town.net (Postfix) with ESMTP id EE7591AAFC for ; Mon, 15 Oct 2012 05:16:26 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at mailgw18.surf-town.net X-Spam-Flag: NO X-Spam-Score: 0.231 X-Spam-Level: X-Spam-Status: No, score=0.231 tagged_above=-999 required=7 tests=[ALL_TRUSTED=-1.44, DCC_CHECK=1.37, HTML_MESSAGE=0.001, SARE_WEOFFER=0.3] autolearn=unavailable Received: from mailgw18.surf-town.net ([127.0.0.1]) by localhost (mailgw18.surf-town.net [127.0.0.1]) (amavisd-new, port 10024) with LMTP id FHvI0KeII+3t for ; Mon, 15 Oct 2012 05:16:26 +0200 (CEST) Received: from [172.21.212.133] (unknown [178.86.30.201]) by mailgw18.surf-town.net (Postfix) with ESMTPA id 601DD1AAF9 for ; Mon, 15 Oct 2012 05:16:25 +0200 (CEST) From: "QATAR - FINANCE" To: "freebsd-hardware" Subject: Loan offer at interest rate of 2 % per annun. Message-ID: <6711e3f8cd95324ab4fbf16509fe0c8a@MAXILASE-PC> Date: Mon, 15 Oct 2012 03:16:21 +0000 MIME-Version: 1.0 X-Priority: 1 X-Mailer: Microsoft Office Outlook 13.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: qatar@infomaniak.ch List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 03:16:28 -0000 Loan offer at interest rate of 2 % per annun. =0D=0A=0D=0ADo = you need a loan to clear your debts/bills, or start up a business? = Then consider your financial problems over. In today's economic = climate, finding reliable funding sources can be frustrating = and full of disappointments, but with our sophisticated loan = repayment plan, everyone Smiles home.=0D=0AQatar Loan Finance = Foundation provides financing for alternative energy, commercial = real estate projects and personal financing; thus, arranging = a loan with us is simple and straightforward, convenient and = fast. We can give you an immediate 'in principle' decision and = we'll find you some of the most competitive personal loan rates = available.=0D=0AWe Offer guaranteed loan services of any amount = and to any part of the world for (individuals, companies, realtors = and corporate bodies) at our superb interest rate of 2%. Our = team of loan experts first listens to our client's requirements = and then provides them with best loan solutions.=0D=0A=0D=0A=0D=0AWe = Offer LOANS ranging from 100.000 euros Min. to 500 000 000 euros = Max. at 2 % interest rate per annun, Loans for developing your = business expansion. We are certified, trustworthy, reliable, = efficient, Fast and dynamic. and a co-operate financier for = real estate and any kinds of business financing, we give out = long term loan for five to ten years maximum.=0D=0A=0D=0APlease = if you are interested in our financial offer, do not hesitate = to contact us at qatar@infomaniak.ch for more informations =0D=0A =0D= =0AEmir SHEIK HAMAD BIN KHALIFA AL THANI =0D=0AQatar Loan Finance = Foundation From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 09:59:09 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D2A7AA83 for ; Mon, 15 Oct 2012 09:59:09 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 7B99D8FC08 for ; Mon, 15 Oct 2012 09:59:08 +0000 (UTC) Received: from server.rulingia.com (c220-239-248-178.belrs5.nsw.optusnet.com.au [220.239.248.178]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id q9F9x5xW003269 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 15 Oct 2012 20:59:06 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id q9F9wxE2069954 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 15 Oct 2012 20:59:00 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id q9F9wwvD069888; Mon, 15 Oct 2012 20:58:58 +1100 (EST) (envelope-from peter) Date: Mon, 15 Oct 2012 20:58:58 +1100 From: Peter Jeremy To: nate keegan Subject: Re: ahcich Timeouts SATA SSD Message-ID: <20121015095858.GC33428@server.rulingia.com> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NMuMz9nt05w80d4+" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-hardware@freebsd.org X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 09:59:10 -0000 --NMuMz9nt05w80d4+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2012-Oct-14 16:03:39 -0700, nate keegan wrote: >Based on what I'm seeing for post types on freebsd-questions this >might be the best forum for this issue as it looks like some sort of a >strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives. > >This system was commissioned in February of 2012 and ran without issue >as a ZFS backup system on our network until about 3 weeks ago. > >At that time I started getting kernel panics due to timeouts to the >on-board SATA devices. The only change to the system since it was >built was to add an SSD for swap (32 Gb swap device) and this issue >did not happen until several months after this was added. This _does_ sound more like hardware than software - it's difficult to envisage a software bug that does nothing for 6 months and then makes the system hang regularly. Has there been any significant change to the system load, how much data is being transferred, clients, how full the data zpool is, etc that might correlate with the onset of hangs? >I then moved to systematically replacing items such as SATA cables, >memory, motherboard, etc and the problem continued. For example, I >swapped out the 4 SATA cables with brand new SATA cables and waited to >see if the problem happened again. Once it did I moved on to replacing >the motherboard with an identical motherboard, waited, etc. Have you tried replacing RAM & PSU? >The system logs do not show anything prior to event happening and the >OS will respond to ping requests after the issue and if you have an >active SSH session you will remain connected to the system until you >attempt to do something like 'ls', 'ps', etc. This implies that the kernel is still active but the filesystem is deadlocked. Are you able to drop into DDB? Is anything displayed on the kernel? >New SSH requests to the system get 'connection refused'. This implies that sshd has died - a filesystem deadlock should result in connection attempts either timing out or just hanging. >I'm open to suggestions, direction, etc to see if I can nail down what >is going on and put this issue to bed for not only myself but for >anyone else who might run into it in the future. Are you running a GENERIC kernel? If not, what changes have you made? Have you set any loader tunables or sysctls? Have you scrubbed the pools? If you run "gstat -a", do any devices have anomolous readings? I can't offer any definite fixes but can suggest a few more things to try: 1) Try FreeBSD-9.1RC2 and see if the problem persists. 2) Try a new kernel with options WITNESS options WITNESS_SKIPSPIN this may make a software bug more obvious (but will somewhat increase kernel overheads) 3) If you can afford it, detach the L2ARC - which removes one potential iss= ue. 4) If you haven't already, build a kernel with makeoptions DEBUG=3D-g options KDB options KDB_TRACE options KDB_UNATTENDED options DDB this won't have any impact on normal operation but will simplify debuggi= ng. --=20 Peter Jeremy --NMuMz9nt05w80d4+ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlB73mIACgkQ/opHv/APuIcFbwCgs2yVL26Elp00dyJ0subqzyHe qQUAoKAhqJmSZFRPf9RfYTSpO6dNuo5X =IQnL -----END PGP SIGNATURE----- --NMuMz9nt05w80d4+-- From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 10:35:13 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B061111B for ; Mon, 15 Oct 2012 10:35:13 +0000 (UTC) (envelope-from patpro@patpro.net) Received: from rack.patpro.net (rack.patpro.net [193.30.227.216]) by mx1.freebsd.org (Postfix) with ESMTP id 1F0A38FC12 for ; Mon, 15 Oct 2012 10:35:12 +0000 (UTC) Received: from rack.patpro.net (localhost [127.0.0.1]) by rack.patpro.net (Postfix) with ESMTP id DF46C1CC020; Mon, 15 Oct 2012 12:26:39 +0200 (CEST) X-Virus-Scanned: amavisd-new at patpro.net Received: from amavis-at-patpro.net ([127.0.0.1]) by rack.patpro.net (rack.patpro.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nxLQ9ZamRpft; Mon, 15 Oct 2012 12:26:34 +0200 (CEST) Received: from [127.0.0.1] (localhost [127.0.0.1]) by rack.patpro.net (Postfix) with ESMTP; Mon, 15 Oct 2012 12:26:34 +0200 (CEST) Subject: Re: ahcich Timeouts SATA SSD Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: multipart/signed; boundary=Apple-Mail-102-474922847; protocol="application/pkcs7-signature"; micalg=sha1 From: Patrick Proniewski In-Reply-To: <20121015095858.GC33428@server.rulingia.com> Date: Mon, 15 Oct 2012 12:26:33 +0200 Message-Id: <038654D6-9944-4AF8-B299-AE3BF6C28343@patpro.net> References: <20121015095858.GC33428@server.rulingia.com> To: Peter Jeremy X-Mailer: Apple Mail (2.1085) X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: freebsd-hardware@freebsd.org, nate keegan X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 10:35:13 -0000 --Apple-Mail-102-474922847 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On 15 oct. 2012, at 11:58, Peter Jeremy wrote: > This _does_ sound more like hardware than software I do agree with that. > Have you tried replacing RAM & PSU? I, too, was about to suggest a test or replacement of the PSU. Also, I've had a (quite) similar problem years ago (no raid, no zfs, = older freebsd=85) where HDD would detach or be lost by the system on a = random basis. I search a long time of the software side, but it was = cured by a firmware update on HDDs. good luck with this issue. Patrick= --Apple-Mail-102-474922847-- From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 11:06:12 2012 Return-Path: Delivered-To: freebsd-hardware@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6A9EA509 for ; Mon, 15 Oct 2012 11:06:12 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [8.8.178.135]) by mx1.freebsd.org (Postfix) with ESMTP id 375DF8FC31 for ; Mon, 15 Oct 2012 11:06:12 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q9FB6Ckx011486 for ; Mon, 15 Oct 2012 11:06:12 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.5/8.14.5/Submit) id q9FB6CaR011485 for freebsd-hardware@FreeBSD.org; Mon, 15 Oct 2012 11:06:12 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 15 Oct 2012 11:06:12 GMT Message-Id: <201210151106.q9FB6CaR011485@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-hardware@FreeBSD.org Subject: Current problem reports assigned to freebsd-hardware@FreeBSD.org X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 11:06:12 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 14:54:29 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5C92B10B for ; Mon, 15 Oct 2012 14:54:29 +0000 (UTC) (envelope-from nate.keegan@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id 09DA78FC0A for ; Mon, 15 Oct 2012 14:54:28 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fw7so7588836vcb.13 for ; Mon, 15 Oct 2012 07:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=JQ6NCrlcodzj5rO9vv2AoDaNtBQ0pBxs8F/A7h1gTwg=; b=x7/ctLI3MAYnumtEjoMHP5sMXfxraQ0LgNICa9/hwzTSTO3C2+WnS0WiQy8QogBwML 9/3gJpJQZNkGoaqn2GYipXUFSm0JaMNBc75IBFdn1+tHVBm7my88/M4pVPcFp5r6syij uUkoCOkbJzm9s0AWKMxlrdOC88S/b+QXHnnhB+eR7vRjt1B99LKo72HKsikBdaVxiSLl qqKNqGus6+nI2CY/o41ztFH+dUqHBdo1FcdE790KIWgVuflwg3gH+J4Wi+TNQ5A2pudy clhBshzmcaOEqT71L1mfxmZoWlWZdKQsd4if/MbIzys4r9iCRXd9iasztR3d/UIwNMLQ o21A== MIME-Version: 1.0 Received: by 10.52.66.36 with SMTP id c4mr5565912vdt.6.1350312862021; Mon, 15 Oct 2012 07:54:22 -0700 (PDT) Received: by 10.58.240.42 with HTTP; Mon, 15 Oct 2012 07:54:21 -0700 (PDT) In-Reply-To: <20121015095858.GC33428@server.rulingia.com> References: <20121015095858.GC33428@server.rulingia.com> Date: Mon, 15 Oct 2012 07:54:21 -0700 Message-ID: Subject: Re: ahcich Timeouts SATA SSD From: nate keegan To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 14:54:29 -0000 The system is dual PSU behind a UPS so I don't think that this is an issue. My notes show that we replaced one of the DIMMs on this system a few months ago as it was detected as bad during a POST. During the cycle of reboots that I have taken on with testing resolutions to this issue I have seen a single time where the BIOS detected a bad DIMM but only one time. I do have a complete set of replacement memory (Crucial vs Kingston that is in the system now) and will swap out the memory in case one of the DIMMs is flaky but not poor enough for the BIOS to notice on a consistent basis. I am not able to drop into DDB when the issue happens as the system is locked up completely. Could be a failure on my part to understand/engage in how to do this, will try if the issue happens again (should on Wednesday AM unless setting camcontrol apm to off for the disks somehow fixes the issue). I am running GENERIC kernel and have not set any loader tunables or sysctls other than that related to addressing this issue (SATA power management, AHCI, etc). The problem first started around the time when we setup pool scrubbing and at that time it was a single instance which seemed to be tied to the bad DIMM. Have not run pool scrubbing since that time. Will get the output of gstat -a and post it up here. Will upgrade to FreeBSD 9.1RC2 today and compile kernel with the options you suggested. I already went ahead and removed the L2ARC and one of the OS SSD drives to simplify things - now I have 1 x SSD with OS and 1 x SSD for swap and that is it. I ran the Crucial firmware update ISO and it did not see any firmware updates as necessary on the SSD disks. I appreciate the feedback as part of the difficulty here has been making a determination of whether this is software/driver or hardware. If software I agree that it would not make sense that this would suddenly pop-up after months of operation with no issues. > Are you running a GENERIC kernel? If not, what changes have you made? > Have you set any loader tunables or sysctls? > Have you scrubbed the pools? > If you run "gstat -a", do any devices have anomolous readings? > > I can't offer any definite fixes but can suggest a few more things to > try: > 1) Try FreeBSD-9.1RC2 and see if the problem persists. > 2) Try a new kernel with > options WITNESS > options WITNESS_SKIPSPIN > this may make a software bug more obvious (but will somewhat increase > kernel overheads) > 3) If you can afford it, detach the L2ARC - which removes one potential issue. > 4) If you haven't already, build a kernel with > makeoptions DEBUG=-g > options KDB > options KDB_TRACE > options KDB_UNATTENDED > options DDB > this won't have any impact on normal operation but will simplify debugging. From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 17:21:08 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F2ACAAED for ; Mon, 15 Oct 2012 17:21:07 +0000 (UTC) (envelope-from nate.keegan@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id A3B488FC16 for ; Mon, 15 Oct 2012 17:21:07 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id v11so7038559vbm.13 for ; Mon, 15 Oct 2012 10:21:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=rBo39Ok+QFER0hKIGHmgXznw8o2RNew94JAm9CmkdhA=; b=hLfQbzY+8a2IUshJVyRfgaeyc41yy0k04fKl7EZy60VvBkVaSuP7C4AJXUHqAT6r7H XNLcxWvuWTZ8m/yMiCGiowuhtNSFLqmjkLmBXEIwM4DJCU3EgjZUFUSHiQYwsO9VOS7k IkTZfGJIodiNz8y1TRAH9mVzaKx0tq392z7bwNIbGQ8bNFFCK7ikW4+s/6UoaHoQpkTv UxRRwVzTssutsYP0Vslud+93sn3TvQaQlsSzaknv8gHlOLJSD+znHdMPhKr1jgGqCGvP c50TDa9FcIHXl+o4sre7MZH1zTh2GG4P/ARiNG1VMLKbPOiMKnb7DtEYV5dMTKaggBcs vp7Q== MIME-Version: 1.0 Received: by 10.221.2.76 with SMTP id nt12mr7088337vcb.12.1350321666934; Mon, 15 Oct 2012 10:21:06 -0700 (PDT) Received: by 10.58.240.42 with HTTP; Mon, 15 Oct 2012 10:21:06 -0700 (PDT) In-Reply-To: References: <20121015095858.GC33428@server.rulingia.com> Date: Mon, 15 Oct 2012 10:21:06 -0700 Message-ID: Subject: Re: ahcich Timeouts SATA SSD From: nate keegan To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 17:21:08 -0000 I took a look at the DDB man page and I am not able to do this when the issue happens as the system is completely blown up (meaning no keyboard input on IPMI console, existing SSH sessions, etc. No changes have been seen in the ZFS load on the system. The nature of this system (backup) is such that the heaviest load would be created in the first week or so of going online as we use rsync to copy files down from our Windows servers and during this first week or so the system has to 'seed' the initial copies which would be much heavier on I/O than after that first week where things are relatively constant in terms of I/O. I have 48 Gb of Crucial memory that I will put in this system today to replace the 24 Gb or so of Kingston memory I have in the system. If the issue happens again with the memory change I plan on replacing both SSD (Crucial M4) with two non-SSD SATA disks with the idea that maybe the Crucial firmware on the disks (002 on both disks) is the culprit somehow. It neither item turn out to solve the issue will move on to 9.1RC2 or 9.1-RELEASE if it is out by then and adding kernel options requested. The amount of monkeying that I have had to do via /boot/loader.conf and the camcontrol script I run is telling me that the SSD, the firmware on the SSD, etc is somehow causing the issue as we have plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA drives without this issue popping up in their daily workloads. My /boot/loader.conf looks like this currently: # Set in the BIOS as well to activate ahci_load="YES" # Should be auto-negotiation in FreeBSD 9.x # See ahci(4) hint.ahcich.0.sata_rev=1 hint.ahcich.1.sata_rev=1 hint.ahcich.0.pm_level=1 hint.ahcich.1.pm_level=1 And /usr/local/etc/rc.d/camcontrol: #!/bin/sh CAMCONTROL=/sbin/camcontrol # Disable NCQ $CAMCONTROL tags ada0 -N 1 > /dev/null $CAMCONTROL tags ada1 -N 1 > /dev/null # Disable APM $CAMCONTROL cmd ada0 -a "EF 85 00 00 00 00 00 00 00 00 00 00" > /dev/null $CAMCONTROL cmd ada1 -a "EF 85 00 00 00 00 00 00 00 00 00 00" > /dev/null Without both of these shims in place I get maybe 1.5 hours to two hours or so before the system goes kablooie and that is without the system doing any real I/O work just running FreeBSD during the business day and a few scripts from cron to check for data and shuffle it around. From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 21:55:01 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3D6ED6CB for ; Mon, 15 Oct 2012 21:55:01 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id DB7378FC0C for ; Mon, 15 Oct 2012 21:55:00 +0000 (UTC) Received: from server.rulingia.com (c220-239-248-178.belrs5.nsw.optusnet.com.au [220.239.248.178]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id q9FLswf0023403 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 16 Oct 2012 08:54:58 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id q9FLspYW023120 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 16 Oct 2012 08:54:52 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id q9FLspeO023119; Tue, 16 Oct 2012 08:54:51 +1100 (EST) (envelope-from peter) Date: Tue, 16 Oct 2012 08:54:51 +1100 From: Peter Jeremy To: nate keegan Subject: Re: ahcich Timeouts SATA SSD Message-ID: <20121015215451.GE33428@server.rulingia.com> References: <20121015095858.GC33428@server.rulingia.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="lMM8JwqTlfDpEaS6" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-hardware@freebsd.org X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 21:55:01 -0000 --lMM8JwqTlfDpEaS6 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2012-Oct-15 07:54:21 -0700, nate keegan wrote: >The system is dual PSU behind a UPS so I don't think that this is an issue. OK >I do have a complete set of replacement memory (Crucial vs Kingston >that is in the system now) and will swap out the memory in case one of >the DIMMs is flaky but not poor enough for the BIOS to notice on a >consistent basis. I presume this is registered ECC RAM - which makes it more robust. Non-ECC RAM can develop pattern-sensitive faults - which are virtually impossible to test for. And BIOS RAM 'tests' generally can't be relied on to do much more than verify that something is responding. Swapping RAM is the best way to rule out RAM issues. >I am not able to drop into DDB when the issue happens as the system is >locked up completely. That's surprising. I haven't seen a failure mode where the kernel will respond to pings but not the console. >Will get the output of gstat -a and post it up here. "gstat -a" gives a dynamic picture of disk activity. I was hoping you could watch it for a minute or so (on a tall window) whilst the system was running and see if any disks look odd - significantly higher or lower than expected I/O volume or long ms/r or ms/w. On 2012-Oct-15 10:21:06 -0700, nate keegan wrote: >I took a look at the DDB man page and I am not able to do this when >the issue happens as the system is completely blown up (meaning no >keyboard input on IPMI console, existing SSH sessions, etc. Note that I'm referring to ddb(4), not ddb(8). The former is entered via a "magic" key sequence on the console and should work even if the system won't react to normal commands. To enter ddb, use Ctrl-Alt-ESC on a graphical console or the character sequence CR ~ Ctrl-B on a serial console (in the latter case, the sysctl debug.kdb.alt_break_to_debugger also needs to be set to 1). If you do get into ddb, a useful set of initial commands is: show all procs show alllocks show allpcpu show lockedvnods call doadump Note that the first 4 commands will generate lots of output - ideally you would have a serial console with logging. The last command generates a crashdump and needs 'dumpdev=3D"AUTO"' in /etc/rc.conf (run "service dumpon start" after editing rc.conf to enable it without rebooting). >The amount of monkeying that I have had to do via /boot/loader.conf >and the camcontrol script I run is telling me that the SSD, the >firmware on the SSD, etc is somehow causing the issue as we have >plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA >drives without this issue popping up in their daily workloads. Are you able to move the SSD(s) to a different type of SATA port? One (not especially likely) possibility is it's an interaction between the SSD and the SATA controller. --=20 Peter Jeremy --lMM8JwqTlfDpEaS6 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlB8hisACgkQ/opHv/APuIeFVQCfbV8Oj+V1KFHTq0mutiGBBWLl kYcAnR7gP4OFXOzvUl8Y/ZIajZN1Wy9N =qzoM -----END PGP SIGNATURE----- --lMM8JwqTlfDpEaS6-- From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 22:16:57 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 346D9C2E for ; Mon, 15 Oct 2012 22:16:57 +0000 (UTC) (envelope-from dieterbsd@engineer.com) Received: from mailout-us.gmx.com (mailout-us.gmx.com [74.208.5.67]) by mx1.freebsd.org (Postfix) with SMTP id E506D8FC0C for ; Mon, 15 Oct 2012 22:16:56 +0000 (UTC) Received: (qmail 28574 invoked by uid 0); 15 Oct 2012 20:32:31 -0000 Received: from 67.206.187.68 by rms-us011.v300.gmx.net with HTTP Content-Type: text/plain; charset="utf-8" Date: Mon, 15 Oct 2012 16:32:28 -0400 From: "Dieter BSD" Message-ID: <20121015203229.40280@gmx.com> MIME-Version: 1.0 Subject: Re: ahcich Timeouts SATA SSD To: freebsd-hardware@freebsd.org X-Authenticated: #74169980 X-Flags: 0001 X-Mailer: GMX.com Web Mailer x-registered: 0 Content-Transfer-Encoding: 8bit X-GMX-UID: ihcOcNhu3zOlNR3dAHAhJM9+IGRvbwBW X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 22:16:57 -0000 > SSD are connected to on-board SATA port on motherboard Presumably to controllers provided by the Intel Tylersburg 5520 chipset. > This system was commissioned in February of 2012 and ran without issue > as a ZFS backup system on our network until about 3 weeks ago. > The system is dual PSU behind a UPS so I don't think that this is an issue. No changes? e.g. no added hardware to increase power load. Overloading the power supply and/or the wiring (with too many splitters) can result in flaky problems like this. > OS will respond to ping requests after the issue and if you have an > active SSH session you will remain connected to the system until you > attempt to do something like 'ls', 'ps', etc. > I am not able to drop into DDB when the issue happens as the system is > locked up completely. Could be a failure on my part to > understand/engage in how to do this, will try if the issue happens > again (should on Wednesday AM unless setting camcontrol apm to off for > the disks somehow fixes the issue). If the system is alive enough to respond to ping, I'd expect you should be able to get into DDB? Can you get into DDB when the system is working normally? > 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) > 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap > I ran the Crucial firmware update ISO and it did not see any firmware > updates as necessary on the SSD disks. Does the problem happen with both the Crucial and the Intel SSDs? > If software I agree that it would not make sense that this would > suddenly pop-up after months of operation with no issues. If something causes the software/firmware to take a different path, new issues can appear. E.g. error handling or even timing. Infrequently used code paths might not have been tested sufficiently. Does the controller have firmware? Part of the BIOS I suppose. Is there a BIOS update available? Have you considered connecting the SSDs to a different controller? > the on-board AHCI portion of the BIOS does > not always see the disks after the event without a hard system power > reset. That's at least one bug somewhere, probably the hardware isn't getting reset properly. Does Supermicro know about this bug? > I have 48 Gb of Crucial memory that I will put in this system today to > replace the 24 Gb or so of Kingston memory I have in the system. Which in addition to being different memory, should reduce swap activity. Suggestion: move everything to conventional drives. Keep at least one SSD connected to system, but normally unused. Now you can beat on the SSD in a controlled manner to debug the problem. Does reading trigger the problem? Writing? Try dd with different blocksizes, accessing multiple SSDs at once, etc. I have to wonder if there is a timing problem, or missing interrupt, or... > * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended > purpose of this system If it fails with FreeBSD but works with Solaris on the same hardware, then it is almost certainly a problem with the device driver. (Or at least a problem that Solaris has a workaround for.) From owner-freebsd-hardware@FreeBSD.ORG Tue Oct 16 19:48:17 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E11169FD for ; Tue, 16 Oct 2012 19:48:17 +0000 (UTC) (envelope-from nate.keegan@gmail.com) Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by mx1.freebsd.org (Postfix) with ESMTP id 8D86D8FC17 for ; Tue, 16 Oct 2012 19:48:17 +0000 (UTC) Received: by mail-qc0-f182.google.com with SMTP id l39so6671615qcs.13 for ; Tue, 16 Oct 2012 12:48:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cIyBkMl4icwm3zArkZDSiBvHd0msEphQMFUafiIR9DY=; b=M1vytWVore73Ji0BAwH/A0mBsUcIsPOG88biTcmURcu/dwb7hFu7TdurnocOBayfz9 ZAJZJCJIFXq6n5ogjruvQO6kU6PTm3N02Ftg/Bmqoj9hsXorAsJpwxjJhUP+MyJl2/X+ MBqEKXvkm5HTE/biTPWkNXAx286uSvvQ3YCoZsezIxaYM4OETxkK4ABs/n0Pr5JDXljP 4dJoKMs93afo3jvUCho5h4b/arVnKz79QN3BuYFxiZiy1eFdekDY2mA8xYr1SFUweumN Fc3skFBpykt7BOT7xqoHrp/XgiA2sZQ4IenoIjXqq9g6ivcbY2qPfLUeP8Ck4LcyWa6f rKVQ== MIME-Version: 1.0 Received: by 10.58.252.67 with SMTP id zq3mr8355vec.43.1350416896805; Tue, 16 Oct 2012 12:48:16 -0700 (PDT) Received: by 10.58.240.42 with HTTP; Tue, 16 Oct 2012 12:48:16 -0700 (PDT) In-Reply-To: <20121015203229.40280@gmx.com> References: <20121015203229.40280@gmx.com> Date: Tue, 16 Oct 2012 12:48:16 -0700 Message-ID: Subject: Re: ahcich Timeouts SATA SSD From: nate keegan To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Oct 2012 19:48:18 -0000 I'm only seeing gstat output of a few percentage points for the OS disks. I am using ECC memory (both the Kingston and the new Crucial memory) and went ahead and swapped out the SSD for SATA disks this morning. Since both SSD were the same firmware and type/manufacturer I figured it was a good time to address this variable. I also went ahead and put in a serial console server this morning so I have proper console access instead of relying on the Supermicro iLO utility. Will keep an eye on the pure SATA setup to see if it barfs or not. Will try to gather some ddb(4) information if it does barf again. On Mon, Oct 15, 2012 at 1:32 PM, Dieter BSD wrote: >> SSD are connected to on-board SATA port on motherboard > > Presumably to controllers provided by the Intel Tylersburg 5520 chipset. > >> This system was commissioned in February of 2012 and ran without issue >> as a ZFS backup system on our network until about 3 weeks ago. > >> The system is dual PSU behind a UPS so I don't think that this is an issue. > > No changes? e.g. no added hardware to increase power load. > Overloading the power supply and/or the wiring (with too many splitters) > can result in flaky problems like this. > >> OS will respond to ping requests after the issue and if you have an >> active SSH session you will remain connected to the system until you >> attempt to do something like 'ls', 'ps', etc. > >> I am not able to drop into DDB when the issue happens as the system is >> locked up completely. Could be a failure on my part to >> understand/engage in how to do this, will try if the issue happens >> again (should on Wednesday AM unless setting camcontrol apm to off for >> the disks somehow fixes the issue). > > If the system is alive enough to respond to ping, I'd expect you > should be able to get into DDB? Can you get into DDB when the system > is working normally? > >> 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) >> 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap > >> I ran the Crucial firmware update ISO and it did not see any firmware >> updates as necessary on the SSD disks. > > Does the problem happen with both the Crucial and the Intel SSDs? > >> If software I agree that it would not make sense that this would >> suddenly pop-up after months of operation with no issues. > > If something causes the software/firmware to take a different > path, new issues can appear. E.g. error handling or even timing. > Infrequently used code paths might not have been tested sufficiently. > > Does the controller have firmware? Part of the BIOS I suppose. > Is there a BIOS update available? Have you considered connecting the > SSDs to a different controller? > >> the on-board AHCI portion of the BIOS does >> not always see the disks after the event without a hard system power >> reset. > > That's at least one bug somewhere, probably the hardware isn't getting reset > properly. Does Supermicro know about this bug? > >> I have 48 Gb of Crucial memory that I will put in this system today to >> replace the 24 Gb or so of Kingston memory I have in the system. > > Which in addition to being different memory, should reduce swap activity. > > Suggestion: move everything to conventional drives. Keep at least one > SSD connected to system, but normally unused. Now you can beat on the > SSD in a controlled manner to debug the problem. Does reading trigger > the problem? Writing? Try dd with different blocksizes, accessing > multiple SSDs at once, etc. I have to wonder if there is a timing problem, > or missing interrupt, or... > >> * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended >> purpose of this system > > If it fails with FreeBSD but works with Solaris on the same hardware, > then it is almost certainly a problem with the device driver. (Or > at least a problem that Solaris has a workaround for.) > _______________________________________________ > freebsd-hardware@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscribe@freebsd.org" From owner-freebsd-hardware@FreeBSD.ORG Fri Oct 19 21:56:00 2012 Return-Path: Delivered-To: freebsd-hardware@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8988C329; Fri, 19 Oct 2012 21:56:00 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 8E0EF8FC17; Fri, 19 Oct 2012 21:55:56 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id AAA02815; Sat, 20 Oct 2012 00:55:49 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TPKXk-0009A3-Ir; Sat, 20 Oct 2012 00:55:48 +0300 Message-ID: <5081CC62.8080701@FreeBSD.org> Date: Sat, 20 Oct 2012 00:55:46 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:16.0) Gecko/20121013 Thunderbird/16.0.1 MIME-Version: 1.0 To: freebsd-scsi@FreeBSD.org, freebsd-hardware@FreeBSD.org Subject: kern/172833: tws driver update from LSI X-Enigmail-Version: 1.4.5 Content-Type: text/plain; charset=X-VIET-VPS Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Oct 2012 21:56:00 -0000 http://www.freebsd.org/cgi/query-pr.cgi?pr=172833 Maybe someone here would be interested or could comment on the PR. -- Andriy Gapon