From owner-freebsd-stable@FreeBSD.ORG  Tue Jul 15 17:11:43 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3BC2A1065693
	for <freebsd-stable@freebsd.org>; Tue, 15 Jul 2008 17:11:43 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 0B0618FC1C
	for <freebsd-stable@freebsd.org>; Tue, 15 Jul 2008 17:11:42 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m6FHBgr4007482;
	Tue, 15 Jul 2008 10:11:42 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m6FHBgVO007481;
	Tue, 15 Jul 2008 10:11:42 -0700 (PDT)
Date: Tue, 15 Jul 2008 10:11:42 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200807151711.m6FHBgVO007481@apollo.backplane.com>
To: Steve Bertrand <steve@ibctech.ca>
References: <487CCD46.8080506@ibctech.ca>
Cc: freebsd-stable@freebsd.org
Subject: Re: taskqueue timeout
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 15 Jul 2008 17:11:43 -0000


:Hi everyone,
:
:I'm wondering if the problems described in the following link have been 
:resolved:
:
:http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html
:
:I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them 
:are experiencing the behavior.
:
:The problem only happens with extreme disk activity. The box becomes 
:unresponsive (can not SSH etc). Keyboard input is displayed on the 
:console, but the commands are not accepted.
:
:Is there anything I can do to either figure this out, or work around it?
:
:Steve

    If you are getting DMA timeouts, go to this URL:

    http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

    Then I would suggest going into /usr/src/sys/dev/ata (I think, on
    FreeBSD), locate all instances where request->timeout is set to 5,
    and change them all to 10.

	cd /usr/src/sys/dev/ata
	fgrep 'request->timeout' *.c
	... change all assignments of 5 to 10 ...

    Try that first.  If it helps then it is a known issue.  Basically
    a combination of the on-disk write cache and possible ECC corrections,
    remappings, or excessive remapped sectors can cause the drive to take
    much longer then normal to complete a request.  The default 5-second
    timeout is insufficient.

    If it does help, post confirmation to prod the FBsd developers to
    change the timeouts.

    --

    If you are NOT getting DMA timeouts then the ZFS lockups may be due
    to buffer/memory deadlocks.  ZFS has knobs for adjusting its memory
    footprint size.  Lowering the footprint ought to solve (most of) those
    issues.  It's actually somewhat of a hard issue to solve.  Filesystems
    like UFS aren't complex enough to require the sort of dynamic memory
    allocations deep in the filesystem that ZFS and HAMMER need to do.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>