From owner-freebsd-current@FreeBSD.ORG  Sun Sep  7 23:27:06 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 098AB16A4BF; Sun,  7 Sep 2003 23:27:06 -0700 (PDT)
Received: from spider.deepcore.dk
	(cpe.atm2-0-56339.0x50c6aa0a.abnxx2.customer.tele.dk [80.198.170.10])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 182AC44001; Sun,  7 Sep 2003 23:27:04 -0700 (PDT)
	(envelope-from sos@spider.deepcore.dk)
Received: from spider.deepcore.dk (localhost [127.0.0.1])
	by spider.deepcore.dk (8.12.9/8.12.9) with ESMTP id h886R0Io041448;
	Mon, 8 Sep 2003 08:27:00 +0200 (CEST)
	(envelope-from sos@spider.deepcore.dk)
Received: (from sos@localhost)
	by spider.deepcore.dk (8.12.9/8.12.9/Submit) id h886Qxw8041447;
	Mon, 8 Sep 2003 08:26:59 +0200 (CEST)
From: Soren Schmidt <sos@spider.deepcore.dk>
Message-Id: <200309080626.h886Qxw8041447@spider.deepcore.dk>
In-Reply-To: <20030908025121.GQ560@gelatinous.com>
To: Aaron Smith <aaron@mutex.org>
Date: Mon, 8 Sep 2003 08:26:59 +0200 (CEST)
X-Mailer: ELM [version 2.4ME+ PL99f (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=ISO-8859-1
X-mail-scanned: by DeepCore Virus & Spam killer v1.3
cc: freebsd-current@FreeBSD.ORG
cc: sos@FreeBSD.ORG
Subject: Re: pst driver: timeout explosion? (patch is attached)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Sep 2003 06:27:06 -0000

It seems Aaron Smith wrote:
> Hi,
> 
> I think I may have found the cause of the pst timeout panics.  I'm using
> the Promise SX6000 RAID on -CURRENT, using the pst driver.  Unfortunately,
> under sufficiently high I/O load, the box starts printing:
> 
>   "pst: timeout mfa=0x00327b90 cmd=0x01"
> 
> The 'mfa' address varies. It starts printing more and more rapidly, and
> then eventually the machine wedges solid. Sometimes it makes it to:
> 
>   "panic: timeout table full"
> 
> Here's what I think is happening. Two timeouts are being scheduled every
> time a timeout triggers, because pst_timeout schedules a timeout before
> calling pst_rw to retry the operation. Then pst_rw schedules ANOTHER
> timeout.  Both of these timeouts call pst_timeout, so they double every 10
> seconds until there are a large enough number of timeouts firing, retrying
> the same I/O operation, that the table fills and the machine panics.
> 
> Check out the following diff
> 
>   http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/pst/pst-raid.c.diff?r1=1.8&r2=1.9&f=h
> 
> This is where pst_rw was changed to schedule its own timeouts, but the
> timeout function didn't have its removed.
> 
> Do you think this could be the correct explanation? It seems like once
> pst_timeout is called, the machine is doomed... I'm recompiling my kernel
> now to test the fix under load.

Yes, correct, there is a double timeout call in case of a timeout.
This explains why it goes down burning, but it still does explain
why we get the first timeout which I've been hunting for ages.

I'll commit the fix right away for the double timeout call, thanks!!

-Søren