From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 21:55:01 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3D6ED6CB for ; Mon, 15 Oct 2012 21:55:01 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id DB7378FC0C for ; Mon, 15 Oct 2012 21:55:00 +0000 (UTC) Received: from server.rulingia.com (c220-239-248-178.belrs5.nsw.optusnet.com.au [220.239.248.178]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id q9FLswf0023403 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 16 Oct 2012 08:54:58 +1100 (EST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.14.5/8.14.5) with ESMTP id q9FLspYW023120 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 16 Oct 2012 08:54:52 +1100 (EST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.14.5/8.14.5/Submit) id q9FLspeO023119; Tue, 16 Oct 2012 08:54:51 +1100 (EST) (envelope-from peter) Date: Tue, 16 Oct 2012 08:54:51 +1100 From: Peter Jeremy To: nate keegan Subject: Re: ahcich Timeouts SATA SSD Message-ID: <20121015215451.GE33428@server.rulingia.com> References: <20121015095858.GC33428@server.rulingia.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="lMM8JwqTlfDpEaS6" Content-Disposition: inline In-Reply-To: X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-hardware@freebsd.org X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 21:55:01 -0000 --lMM8JwqTlfDpEaS6 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2012-Oct-15 07:54:21 -0700, nate keegan wrote: >The system is dual PSU behind a UPS so I don't think that this is an issue. OK >I do have a complete set of replacement memory (Crucial vs Kingston >that is in the system now) and will swap out the memory in case one of >the DIMMs is flaky but not poor enough for the BIOS to notice on a >consistent basis. I presume this is registered ECC RAM - which makes it more robust. Non-ECC RAM can develop pattern-sensitive faults - which are virtually impossible to test for. And BIOS RAM 'tests' generally can't be relied on to do much more than verify that something is responding. Swapping RAM is the best way to rule out RAM issues. >I am not able to drop into DDB when the issue happens as the system is >locked up completely. That's surprising. I haven't seen a failure mode where the kernel will respond to pings but not the console. >Will get the output of gstat -a and post it up here. "gstat -a" gives a dynamic picture of disk activity. I was hoping you could watch it for a minute or so (on a tall window) whilst the system was running and see if any disks look odd - significantly higher or lower than expected I/O volume or long ms/r or ms/w. On 2012-Oct-15 10:21:06 -0700, nate keegan wrote: >I took a look at the DDB man page and I am not able to do this when >the issue happens as the system is completely blown up (meaning no >keyboard input on IPMI console, existing SSH sessions, etc. Note that I'm referring to ddb(4), not ddb(8). The former is entered via a "magic" key sequence on the console and should work even if the system won't react to normal commands. To enter ddb, use Ctrl-Alt-ESC on a graphical console or the character sequence CR ~ Ctrl-B on a serial console (in the latter case, the sysctl debug.kdb.alt_break_to_debugger also needs to be set to 1). If you do get into ddb, a useful set of initial commands is: show all procs show alllocks show allpcpu show lockedvnods call doadump Note that the first 4 commands will generate lots of output - ideally you would have a serial console with logging. The last command generates a crashdump and needs 'dumpdev=3D"AUTO"' in /etc/rc.conf (run "service dumpon start" after editing rc.conf to enable it without rebooting). >The amount of monkeying that I have had to do via /boot/loader.conf >and the camcontrol script I run is telling me that the SSD, the >firmware on the SSD, etc is somehow causing the issue as we have >plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA >drives without this issue popping up in their daily workloads. Are you able to move the SSD(s) to a different type of SATA port? One (not especially likely) possibility is it's an interaction between the SSD and the SATA controller. --=20 Peter Jeremy --lMM8JwqTlfDpEaS6 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlB8hisACgkQ/opHv/APuIeFVQCfbV8Oj+V1KFHTq0mutiGBBWLl kYcAnR7gP4OFXOzvUl8Y/ZIajZN1Wy9N =qzoM -----END PGP SIGNATURE----- --lMM8JwqTlfDpEaS6--