From owner-freebsd-stable@FreeBSD.ORG Tue Nov 22 02:12:25 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9DC1716A4A1 for ; Tue, 22 Nov 2005 02:12:25 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4396A43D58 for ; Tue, 22 Nov 2005 02:12:25 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 255761A3C1C; Mon, 21 Nov 2005 18:12:25 -0800 (PST) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 7CC42519EF; Mon, 21 Nov 2005 21:12:24 -0500 (EST) Date: Mon, 21 Nov 2005 21:12:24 -0500 From: Kris Kennaway To: Greg Rivers Message-ID: <20051122021224.GA12402@xor.obsecurity.org> References: <20051121164139.T48994@w10.sac.fedex.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="CE+1k2dSO48ffgeK" Content-Disposition: inline In-Reply-To: <20051121164139.T48994@w10.sac.fedex.com> User-Agent: Mutt/1.4.2.1i Cc: freebsd-stable@freebsd.org Subject: Re: Recurring problem: processes block accessing UFS file system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Nov 2005 02:12:26 -0000 --CE+1k2dSO48ffgeK Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Nov 21, 2005 at 05:54:09PM -0600, Greg Rivers wrote: > I've recently put up three busy email relay hosts running 6.0-STABLE.=20 > Performance is excellent except for a nagging critical issue that keeps= =20 > cropping up. >=20 > /var/spool is its own file system mounted on a geom stripe of four BSD=20 > partitions (details below). Once every two or three days all the=20 > processes accessing /var/spool block forever in disk wait. All three=20 > machines suffer this problem. No diagnostic messages are generated and= =20 > the machines continue running fine otherwise, but a reboot is required to= =20 > clear the condition. This problem occurs during normal operation, but is= =20 > particularly likely to occur during a backup when dump makes a snapshot. >=20 > There doesn't appear to be a problem with gstripe, as gstripe status is= =20 > "UP" and I can read the raw device just fine while processes continue to= =20 > block on the file system. I tried running a kernel with WITNESS and=20 > DIAGNOSTIC, but these options shed no light. >=20 > If I catch the problem early enough I can break successfully into kdb;=20 > otherwise, if too many processes stack up, the machine hangs going into= =20 > kdb and must be power-cycled. Make sure you have KDB_STOP_NMI in your kernel. > I obtained the following process listing and traces from kdb. I traced > mksnap_ffs which was blocked in "ufs", and two random sendmail processes > that were blocked in "ufs" and "suspfs" respectively. Looks like a UFS snapshot deadlock. Are you running something like dump -L on this filesystem, or making other use of snapshots? fsck -B also uses them, but shouldn't be running except at boot time. You should take this up with Kirk McKusick - in the meantime you can work around it by not making use of UFS snapshots. Kris --CE+1k2dSO48ffgeK Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (FreeBSD) iD8DBQFDgn6HWry0BWjoQKURAlYMAJwI/gWWLYpScm/qOv7nHraxiDp7QgCeLXok ICacUFcnQmb3Jtm+gKOtUNw= =Xou6 -----END PGP SIGNATURE----- --CE+1k2dSO48ffgeK--