From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 09:59:09 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D2A7AA83 for ; Mon, 15 Oct 2012 09:59:09 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id 7B99D8FC08 for ; Mon, 15 Oct 2012 09:59:08 +0000 (UTC) Received: from server.rulingia.com (c220-239-248-178.belrs5.nsw.optusnet.com.au [220.239.248.178]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id q9F9x5xW003269 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 15 Oct 2012 20:59:06 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id q9F9wxE2069954 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 15 Oct 2012 20:59:00 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id q9F9wwvD069888; Mon, 15 Oct 2012 20:58:58 +1100 (EST) (envelope-from peter) Date: Mon, 15 Oct 2012 20:58:58 +1100 From: Peter Jeremy To: nate keegan Subject: Re: ahcich Timeouts SATA SSD Message-ID: <20121015095858.GC33428@server.rulingia.com> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NMuMz9nt05w80d4+" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-hardware@freebsd.org X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 09:59:10 -0000 --NMuMz9nt05w80d4+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2012-Oct-14 16:03:39 -0700, nate keegan wrote: >Based on what I'm seeing for post types on freebsd-questions this >might be the best forum for this issue as it looks like some sort of a >strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives. > >This system was commissioned in February of 2012 and ran without issue >as a ZFS backup system on our network until about 3 weeks ago. > >At that time I started getting kernel panics due to timeouts to the >on-board SATA devices. The only change to the system since it was >built was to add an SSD for swap (32 Gb swap device) and this issue >did not happen until several months after this was added. This _does_ sound more like hardware than software - it's difficult to envisage a software bug that does nothing for 6 months and then makes the system hang regularly. Has there been any significant change to the system load, how much data is being transferred, clients, how full the data zpool is, etc that might correlate with the onset of hangs? >I then moved to systematically replacing items such as SATA cables, >memory, motherboard, etc and the problem continued. For example, I >swapped out the 4 SATA cables with brand new SATA cables and waited to >see if the problem happened again. Once it did I moved on to replacing >the motherboard with an identical motherboard, waited, etc. Have you tried replacing RAM & PSU? >The system logs do not show anything prior to event happening and the >OS will respond to ping requests after the issue and if you have an >active SSH session you will remain connected to the system until you >attempt to do something like 'ls', 'ps', etc. This implies that the kernel is still active but the filesystem is deadlocked. Are you able to drop into DDB? Is anything displayed on the kernel? >New SSH requests to the system get 'connection refused'. This implies that sshd has died - a filesystem deadlock should result in connection attempts either timing out or just hanging. >I'm open to suggestions, direction, etc to see if I can nail down what >is going on and put this issue to bed for not only myself but for >anyone else who might run into it in the future. Are you running a GENERIC kernel? If not, what changes have you made? Have you set any loader tunables or sysctls? Have you scrubbed the pools? If you run "gstat -a", do any devices have anomolous readings? I can't offer any definite fixes but can suggest a few more things to try: 1) Try FreeBSD-9.1RC2 and see if the problem persists. 2) Try a new kernel with options WITNESS options WITNESS_SKIPSPIN this may make a software bug more obvious (but will somewhat increase kernel overheads) 3) If you can afford it, detach the L2ARC - which removes one potential iss= ue. 4) If you haven't already, build a kernel with makeoptions DEBUG=3D-g options KDB options KDB_TRACE options KDB_UNATTENDED options DDB this won't have any impact on normal operation but will simplify debuggi= ng. --=20 Peter Jeremy --NMuMz9nt05w80d4+ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlB73mIACgkQ/opHv/APuIcFbwCgs2yVL26Elp00dyJ0subqzyHe qQUAoKAhqJmSZFRPf9RfYTSpO6dNuo5X =IQnL -----END PGP SIGNATURE----- --NMuMz9nt05w80d4+--