From owner-freebsd-current@FreeBSD.ORG Wed Oct 13 22:38:07 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B7AC716A4CE for ; Wed, 13 Oct 2004 22:38:07 +0000 (GMT) Received: from corbulon.video-collage.com (aldan.algebra.com [216.254.65.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7E63F43D55 for ; Wed, 13 Oct 2004 22:38:06 +0000 (GMT) (envelope-from Mikhail.Teterin@murex.com) Received: from 250-217.customer.cloud9.net (195-11.customer.cloud9.net [168.100.195.11])i9DMbw88076242 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 13 Oct 2004 18:38:04 -0400 (EDT) (envelope-from Mikhail.Teterin@murex.com) Received: from localhost (mteterin@localhost [127.0.0.1]) i9DMbqAg021237; Wed, 13 Oct 2004 18:37:52 -0400 (EDT) (envelope-from Mikhail.Teterin@murex.com) From: Mikhail Teterin Organization: Murex N.A. To: Matthew Dillon Date: Wed, 13 Oct 2004 18:37:51 -0400 User-Agent: KMail/1.7 References: <416AE7D7.3030502@murex.com> <416C2502.5040505@murex.com> <200410130431.i9D4VjPJ094849@apollo.backplane.com> In-Reply-To: <200410130431.i9D4VjPJ094849@apollo.backplane.com> MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-u" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200410131837.51832@misha-mx.virtual-estates.net> X-Virus-Scanned: clamd / ClamAV version devel-20040615, clamav-milter version 0.73a on corbulon.video-collage.com X-Virus-Status: Clean X-Scanned-By: MIMEDefang 2.43 X-Mailman-Approved-At: Thu, 14 Oct 2004 12:10:41 +0000 cc: freebsd-current@freebsd.org cc: bde@zeta.org.au Subject: Re: panic in ffs (Re: hangs in nbufkv) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Oct 2004 22:38:07 -0000 =:I don't know, how, but the bug seems triggered by upping the =:net.inet.udp.maxdgram from 9216 (default) to 16384 (to match the NFS =:client's wsize). Once I do that, the machine will either panic or just =:hang a few minutes into the heavy NFS writing (Sybase database dumps =:from a Solaris server). Happened twice already... = Interesting. That's getting a bit outside the realm I can help = with. NFS and the network stack have been issues in FreeBSD = recently so its probably something related. Actually, that's not it. Even if I don't touch any sysctl's, but simply proceed loading the machine with our backup scripts, it will eventually either hang (after many complains about WRITE_DMA problems with the disk, NFS clients write to) or panic with: initiate_write_inodeblock_ufs2: already started (in /sys/ufs/ffs/ffs_softdep.c). As for the WRITE_DMA problems, after going through two disks, two cables, and two different on-board SATA connectors, we concluded, the problem is with the ata-driver (hence http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/72451). As for panics, I set the BKVASIZE back down to 16Kb, rebuilt the kernel and recreated the filesystem, that used to have the 64K-bsize. Machine still either panics or hangs under load. May be, I should give a bit more details about the load. The load is produced by a script, which tells the Sybase server to dump one database at a time over NFS to the "staging" disk (single SATA150 drive) and, as each database is dumped, compresses it onto the RAID5 array for storage. When the thing is working properly, the Sybase server writes at or close to the wire speed (9-11Mb/second). Unfortunately, the staging disk soon starts throwing the above mentioned WRITE_DMA errors. Fortunately, those are usually recoverable. Unfortunately, the machine eventually hangs anyway... I changed the script to use the RAID5-partition as the staging area as well (this is the filesystem, that used to have 64Kb bsize and 8Kb fsize -- it is over 1Tb large) and it seems to work for now, but the throughput is much lower, than it used to be (limited by the raid-controller's i/o). Another observation, I can make, is that 'bufdaemon' often takes up 50-80% of the CPU time (on a 2.2 Opteron!) while this script is running. Not sure if that's normal or not. -mi