From owner-freebsd-fs@FreeBSD.ORG Sun Feb 22 11:01:00 2009 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 561DD1065670 for ; Sun, 22 Feb 2009 11:01:00 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.terabit.net.ua (mail.terabit.net.ua [195.137.202.147]) by mx1.freebsd.org (Postfix) with ESMTP id E22048FC16 for ; Sun, 22 Feb 2009 11:00:59 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from skuns.zoral.com.ua ([91.193.166.194] helo=mail.zoral.com.ua) by mail.terabit.net.ua with esmtps (TLSv1:AES256-SHA:256) (Exim 4.63 (FreeBSD)) (envelope-from ) id 1LbC4u-000AYI-4v; Sun, 22 Feb 2009 13:00:56 +0200 Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id n1MB0rlw021304 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 22 Feb 2009 13:00:53 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3) with ESMTP id n1MB0qL7073322; Sun, 22 Feb 2009 13:00:52 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3/Submit) id n1MB0qGW073321; Sun, 22 Feb 2009 13:00:52 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 22 Feb 2009 13:00:52 +0200 From: Kostik Belousov To: Carl Message-ID: <20090222110052.GH41617@deviant.kiev.zoral.com.ua> References: <49A10626.8060705@telus.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="sjel/IY1pyoUgMMX" Content-Disposition: inline In-Reply-To: <49A10626.8060705@telus.net> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua X-Virus-Scanned: mail.terabit.net.ua 1LbC4u-000AYI-4v af201d6330edbbb74bc5967f3407e2b6 X-Terabit: YES Cc: freebsd-fs@freebsd.org Subject: Re: UFS2 and/or sparse file bug causing copy process to land in 'D'' state? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Feb 2009 11:01:00 -0000 --sjel/IY1pyoUgMMX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Feb 22, 2009 at 12:00:38AM -0800, Carl wrote: > I've come across what I'm thinking may be a bug in the context of=20 > FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions=20 > when copying a large number of files into a file-backed memory device. >=20 > The consequence of this problem is that a process enters the 'D' state=20 > (process in disk) indefinitely, cannot be killed, and the system cannot= =20 > be shutdown. The only solution is to cold reboot the system, which is a= =20 > really big problem for remote systems. This is happening to me=20 > intermittently with the standard tar-tar pipeline form of copying, but=20 > has happened with the rsync 3.0.4 port as well. >=20 > I would appreciate it if some of you would see if you can repeat this=20 > problem. Here is a sequence of tcsh shell commands which manifest the=20 > problem (on occasion but not every time), which I will refer to as the=20 > "truncate sequence" (depends on fully populated /usr/src tree as data set= ): >=20 > # truncate -s 671088640 target > # mdconfig -f target -S 512 -y 255 -x 63 -u 7 > # bsdlabel -w /dev/md7 auto > # newfs -O2 -m 0 -o space /dev/md7a > # mount /dev/md7a /media > # tar -cvf - -C /usr/src . | tar -xvpof - -C /media > # umount /media ; mdconfig -d -u 7 ; rm target >=20 > An alternate version has yet to fail for me and involves replacing the=20 > first line with this one: >=20 > # dd if=3D/dev/zero of=3Dtarget bs=3D1M count=3D640 >=20 > I'll call that the "dd sequence". Here is an ordered series of tests I=20 > just completed: >=20 > a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed. > b) Repeated dd sequence 7 times - no failures. > c) Repeated truncate sequence 6 time - no failures. > d) Used following sequence to ensure all disk caches flushed: >=20 > # dd if=3D/dev/random of=3Dtarget bs=3D1M count=3D4096 > # dd if=3Dtarget of=3D/dev/null bs=3D1M > # rm target >=20 > e) Repeated truncate sequence 4 times - no failures. > f) Performed orderly reboot. > g) Repeated truncate sequence 2 times - 2nd failed. > h) Performed orderly reboot. > i) Repeated dd sequence 7 times - no failures. >=20 > All failures involve the second tar in the pipeline hanging in the 'D'=20 > state. In each case I do a cold reboot before proceeding with the next te= st. >=20 > It's tempting to speculate that a bug exists in code related to handling= =20 > sparse files specifically, but perhaps it just raises the probability of= =20 > tripping a bug that would eventually manifest in the dd sequence as=20 > well. OTOH, I don't know how to rule out a physical disk or disk=20 > firmware problem. >=20 > This problem has occurred with different data sets and different sized=20 > memory disks, but only with the source and destination filesystems being= =20 > UFS2. I have done similar sequences with EXT2 and FAT16 destinations=20 > with no failures thus far, but the memory disks and data sets were=20 > smaller so it's conceivable that probability worked against me. >=20 > I should note that the drives are Seagate ST31000340AS Barracudas, but=20 > both drives have been upgraded to firmware version SD1A and are=20 > therefore supposedly free of the infamous little horror Seagate=20 > inflicted on so many of us. smartctl tells me that both disks still have= =20 > a raw value of 0 for Reallocated_Sector_Ct and both pass the "short"=20 > self test. Please, see http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kernel= debug-deadlocks.html for instructions on how to gather the required information to diagnose the issue. --sjel/IY1pyoUgMMX Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (FreeBSD) iEYEARECAAYFAkmhMGMACgkQC3+MBN1Mb4iqVACePL6IH3cjmuxS/fBbA662oa6o 1oMAnjiIFXx8lUDtxWyr9TdEWDfnF5xf =7grU -----END PGP SIGNATURE----- --sjel/IY1pyoUgMMX--