From owner-freebsd-current@FreeBSD.ORG Tue Jun 1 07:33:29 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0C5DF16A4CE; Tue, 1 Jun 2004 07:33:29 -0700 (PDT) Received: from mail.ambrisko.com (adsl-64-174-51-43.dsl.snfc21.pacbell.net [64.174.51.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9361743D41; Tue, 1 Jun 2004 07:33:28 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) Received: from server2.ambrisko.com (HELO www.ambrisko.com) (192.168.1.2) by mail.ambrisko.com with ESMTP; 01 Jun 2004 07:33:26 -0700 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.9p2/8.12.9) with ESMTP id i51EXQjd019666; Tue, 1 Jun 2004 07:33:26 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.12.9p2/8.12.9/Submit) id i51EXQWN019665; Tue, 1 Jun 2004 07:33:26 -0700 (PDT) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200406011433.i51EXQWN019665@ambrisko.com> In-Reply-To: <40BC11FA.3050404@freebsd.org> To: Scott Long Date: Tue, 1 Jun 2004 07:33:26 -0700 (PDT) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII cc: Tony Byrne cc: current@freebsd.org Subject: Re: Lockups with Intel ICH5 SATA X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Jun 2004 14:33:29 -0000 Scott Long writes: | Doug Ambrisko wrote: | > Tony Byrne writes: | > | In recent weeks my FreeBSD current box has been experiencing frequent | > | hard lockups. Seldom a day goes by without the machine freezing | > | solid. My hunch is that this is somehow related to the onboard Intel | > | ICH5 controller and SATA HD, because during reboot after a lockup, | > | the machine often complain of a DMA timeout that hangs the box | > | while reading from the SATA drive. | > | > FYI, if the drive has a media error or does a spin down/spin up sequence | > things will hang since the ata driver currently doesn't deal with the | > SATA PHY registers. After a while of ignoring various issues flagged | > in these bits your system will lock up solid on a inb/outb to the controller. | | Can you explain this a bit more? Is the driver ignoring the interrupt | and thus allowing an interrupt storm? Or is it ACK'ing the interrupt, | but the ICH5 controller is expecting a certain further response that | it's not getting? Or is it masking the interrupt entirely which in turn | exposes a flaw in the ICH5 hardware? It's not an interrupt storm but the system locking up hard. The Promise and Intel SATA chip both do this. I figured this out by instrumenting the the FreeBSD interrupt code with to write to the port 0x80 diag port. No interrupts where going on. I then instrumented through a long process of dividing out where the hang is happening and it occurs on either a inb/outb to the controller. The chip never returns. On the Intel SATA reading the SATA registers from the SATA spec. you can see the errors being reported for the SATA PHY. If they are acknowledged then the above inb/outb hangs go away. The Promise card is different since they don't expose the SATA registers. On the Promise card they need to have the "hot plug" monitored to see the disk go then come back. If this is ignored and writes are still done you can get the hang. Also there are 2 resets. You need to do the reset that does the full reset of the HW channel to recover from some media errors etc. In some ways freezing the system makes since. The SATA part of the controller has an error. The OS is ignoring the problem so after a while it freezes the system. Here is the part in the SATA spec.: Error responses are generally classified into four categories Freeze Abort Retry Track/ignore [snip] For the most severe error conditions in which state has been critically perturbed in a way that it is not recoverable, the appropriate error response is to freeze and rely on a reset or similar operation to restore all necessary state to return to normal operation. The side effect a a freeze is that once you get into this error state a normal ATA only type controller access locks the system :-( I guess you can say this is a bug in the Promise & Intel chips that HW shouldn't be able to lock up the system. These issues are addressed in my -stable patches in which I brough over the SATA support from -current and in my minimal patches to add this HW support to -current. My HW patches to -current don't address other issues I've run into. The -stable patches are being used on shipping systems to customers and some other people. A side-effect is that if that hot swap become really easy to deal with since you know when the drive comes and goes and SATA hot swap bays are cheap and easy to come by :-) Now I'm not saying there are not other cases in which people have hit interrupt storms. All I know with my test HW it isn't that just the SATA controller is trying to tell the OS of a problem and is being ignored resulting in a system lock up. This will happen on media errors as well :-( This isn't a problem with the Adaptec aac based SATA controller. It just works :-) Doug A.