From owner-freebsd-current@FreeBSD.ORG  Tue Jun  1 07:33:29 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 0C5DF16A4CE; Tue,  1 Jun 2004 07:33:29 -0700 (PDT)
Received: from mail.ambrisko.com (adsl-64-174-51-43.dsl.snfc21.pacbell.net
	[64.174.51.43])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 9361743D41; Tue,  1 Jun 2004 07:33:28 -0700 (PDT)
	(envelope-from ambrisko@ambrisko.com)
Received: from server2.ambrisko.com (HELO www.ambrisko.com) (192.168.1.2)
  by mail.ambrisko.com with ESMTP; 01 Jun 2004 07:33:26 -0700
Received: from ambrisko.com (localhost [127.0.0.1])
	by www.ambrisko.com (8.12.9p2/8.12.9) with ESMTP id i51EXQjd019666;
	Tue, 1 Jun 2004 07:33:26 -0700 (PDT)
	(envelope-from ambrisko@ambrisko.com)
Received: (from ambrisko@localhost)
	by ambrisko.com (8.12.9p2/8.12.9/Submit) id i51EXQWN019665;
	Tue, 1 Jun 2004 07:33:26 -0700 (PDT)
	(envelope-from ambrisko)
From: Doug Ambrisko <ambrisko@ambrisko.com>
Message-Id: <200406011433.i51EXQWN019665@ambrisko.com>
In-Reply-To: <40BC11FA.3050404@freebsd.org>
To: Scott Long <scottl@freebsd.org>
Date: Tue, 1 Jun 2004 07:33:26 -0700 (PDT)
X-Mailer: ELM [version 2.4ME+ PL94b (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
cc: Tony Byrne <byrnehq@eircom.net>
cc: current@freebsd.org
Subject: Re: Lockups with Intel ICH5 SATA
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Jun 2004 14:33:29 -0000

Scott Long writes:
| Doug Ambrisko wrote:
| > Tony Byrne writes:
| > | In recent weeks my FreeBSD current box has been experiencing frequent
| > | hard lockups.  Seldom a day goes by without the machine freezing
| > | solid.  My hunch is that this is somehow related to the onboard Intel
| > | ICH5 controller and SATA HD, because during reboot after a lockup,
| > | the machine often complain of a DMA timeout that hangs the box
| > | while reading from the SATA drive.
| > 
| > FYI, if the drive has a media error or does a spin down/spin up sequence
| > things will hang since the ata driver currently doesn't deal with the
| > SATA PHY registers.  After a while of ignoring various issues flagged
| > in these bits your system will lock up solid on a inb/outb to the controller.
| 
| Can you explain this a bit more?  Is the driver ignoring the interrupt
| and thus allowing an interrupt storm?  Or is it ACK'ing the interrupt,
| but the ICH5 controller is expecting a certain further response that
| it's not getting?  Or is it masking the interrupt entirely which in turn
| exposes a flaw in the ICH5 hardware?

It's not an interrupt storm but the system locking up hard.  The Promise
and Intel SATA chip both do this.  I figured this out by instrumenting
the the FreeBSD interrupt code with to write to the port 0x80 diag
port.  No interrupts where going on.  I then instrumented through a 
long process of dividing out where the hang is happening and it occurs
on either a inb/outb to the controller.  The chip never returns.

On the Intel SATA reading the SATA registers from the SATA spec. you
can see the errors being reported for the SATA PHY.  If they are
acknowledged then the above inb/outb hangs go away.  The Promise
card is different since they don't expose the SATA registers.  On the
Promise card they need to have the "hot plug" monitored to see the
disk go then come back.  If this is ignored and writes are still done
you can get the hang.  Also there are 2 resets.  You need to do the
reset that does the full reset of the HW channel to recover from
some media errors etc.

In some ways freezing the system makes since.  The SATA part of the
controller has an error.  The OS is ignoring the problem so after
a while it freezes the system.  Here is the part in the SATA spec.:
   Error responses are generally classified into four categories
        Freeze
        Abort
        Retry
        Track/ignore
     [snip]
   For the most severe error conditions in which state has been critically 
   perturbed in a way that it is not recoverable, the appropriate error 
   response is to freeze and rely on a reset or similar operation to 
   restore all necessary state to return to normal operation.

The side effect a a freeze is that once you get into this error state
a normal ATA only type controller access locks the system :-(  I guess
you can say this is a bug in the Promise & Intel chips that HW shouldn't
be able to lock up the system.

These issues are addressed in my -stable patches in which I brough over
the SATA support from -current and in my minimal patches to add this
HW support to -current.  My HW patches to -current don't address other
issues I've run into.  The -stable patches are being used on shipping
systems to customers and some other people.   A side-effect is that
if that hot swap become really easy to deal with since you know when the
drive comes and goes and SATA hot swap bays are cheap and easy to come
by :-)

Now I'm not saying there are not other cases in which people have hit
interrupt storms.  All I know with my test HW it isn't that just the
SATA controller is trying to tell the OS of a problem and is being
ignored resulting in a system lock up.  This will happen on media
errors as well :-(

This isn't a problem with the Adaptec aac based SATA controller.  It just
works :-)

Doug A.