From owner-freebsd-hardware@FreeBSD.ORG Mon Oct 15 14:54:29 2012 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5C92B10B for ; Mon, 15 Oct 2012 14:54:29 +0000 (UTC) (envelope-from nate.keegan@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id 09DA78FC0A for ; Mon, 15 Oct 2012 14:54:28 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fw7so7588836vcb.13 for ; Mon, 15 Oct 2012 07:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=JQ6NCrlcodzj5rO9vv2AoDaNtBQ0pBxs8F/A7h1gTwg=; b=x7/ctLI3MAYnumtEjoMHP5sMXfxraQ0LgNICa9/hwzTSTO3C2+WnS0WiQy8QogBwML 9/3gJpJQZNkGoaqn2GYipXUFSm0JaMNBc75IBFdn1+tHVBm7my88/M4pVPcFp5r6syij uUkoCOkbJzm9s0AWKMxlrdOC88S/b+QXHnnhB+eR7vRjt1B99LKo72HKsikBdaVxiSLl qqKNqGus6+nI2CY/o41ztFH+dUqHBdo1FcdE790KIWgVuflwg3gH+J4Wi+TNQ5A2pudy clhBshzmcaOEqT71L1mfxmZoWlWZdKQsd4if/MbIzys4r9iCRXd9iasztR3d/UIwNMLQ o21A== MIME-Version: 1.0 Received: by 10.52.66.36 with SMTP id c4mr5565912vdt.6.1350312862021; Mon, 15 Oct 2012 07:54:22 -0700 (PDT) Received: by 10.58.240.42 with HTTP; Mon, 15 Oct 2012 07:54:21 -0700 (PDT) In-Reply-To: <20121015095858.GC33428@server.rulingia.com> References: <20121015095858.GC33428@server.rulingia.com> Date: Mon, 15 Oct 2012 07:54:21 -0700 Message-ID: Subject: Re: ahcich Timeouts SATA SSD From: nate keegan To: freebsd-hardware@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 14:54:29 -0000 The system is dual PSU behind a UPS so I don't think that this is an issue. My notes show that we replaced one of the DIMMs on this system a few months ago as it was detected as bad during a POST. During the cycle of reboots that I have taken on with testing resolutions to this issue I have seen a single time where the BIOS detected a bad DIMM but only one time. I do have a complete set of replacement memory (Crucial vs Kingston that is in the system now) and will swap out the memory in case one of the DIMMs is flaky but not poor enough for the BIOS to notice on a consistent basis. I am not able to drop into DDB when the issue happens as the system is locked up completely. Could be a failure on my part to understand/engage in how to do this, will try if the issue happens again (should on Wednesday AM unless setting camcontrol apm to off for the disks somehow fixes the issue). I am running GENERIC kernel and have not set any loader tunables or sysctls other than that related to addressing this issue (SATA power management, AHCI, etc). The problem first started around the time when we setup pool scrubbing and at that time it was a single instance which seemed to be tied to the bad DIMM. Have not run pool scrubbing since that time. Will get the output of gstat -a and post it up here. Will upgrade to FreeBSD 9.1RC2 today and compile kernel with the options you suggested. I already went ahead and removed the L2ARC and one of the OS SSD drives to simplify things - now I have 1 x SSD with OS and 1 x SSD for swap and that is it. I ran the Crucial firmware update ISO and it did not see any firmware updates as necessary on the SSD disks. I appreciate the feedback as part of the difficulty here has been making a determination of whether this is software/driver or hardware. If software I agree that it would not make sense that this would suddenly pop-up after months of operation with no issues. > Are you running a GENERIC kernel? If not, what changes have you made? > Have you set any loader tunables or sysctls? > Have you scrubbed the pools? > If you run "gstat -a", do any devices have anomolous readings? > > I can't offer any definite fixes but can suggest a few more things to > try: > 1) Try FreeBSD-9.1RC2 and see if the problem persists. > 2) Try a new kernel with > options WITNESS > options WITNESS_SKIPSPIN > this may make a software bug more obvious (but will somewhat increase > kernel overheads) > 3) If you can afford it, detach the L2ARC - which removes one potential issue. > 4) If you haven't already, build a kernel with > makeoptions DEBUG=-g > options KDB > options KDB_TRACE > options KDB_UNATTENDED > options DDB > this won't have any impact on normal operation but will simplify debugging.