Date: Sun, 22 Feb 2009 00:00:38 -0800 From: Carl <k0802647@telus.net> To: freebsd-fs@freebsd.org Subject: UFS2 and/or sparse file bug causing copy process to land in 'D'' state? Message-ID: <49A10626.8060705@telus.net>
next in thread | raw e-mail | index | archive | help
I've come across what I'm thinking may be a bug in the context of FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions when copying a large number of files into a file-backed memory device. The consequence of this problem is that a process enters the 'D' state (process in disk) indefinitely, cannot be killed, and the system cannot be shutdown. The only solution is to cold reboot the system, which is a really big problem for remote systems. This is happening to me intermittently with the standard tar-tar pipeline form of copying, but has happened with the rsync 3.0.4 port as well. I would appreciate it if some of you would see if you can repeat this problem. Here is a sequence of tcsh shell commands which manifest the problem (on occasion but not every time), which I will refer to as the "truncate sequence" (depends on fully populated /usr/src tree as data set): # truncate -s 671088640 target # mdconfig -f target -S 512 -y 255 -x 63 -u 7 # bsdlabel -w /dev/md7 auto # newfs -O2 -m 0 -o space /dev/md7a # mount /dev/md7a /media # tar -cvf - -C /usr/src . | tar -xvpof - -C /media # umount /media ; mdconfig -d -u 7 ; rm target An alternate version has yet to fail for me and involves replacing the first line with this one: # dd if=/dev/zero of=target bs=1M count=640 I'll call that the "dd sequence". Here is an ordered series of tests I just completed: a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed. b) Repeated dd sequence 7 times - no failures. c) Repeated truncate sequence 6 time - no failures. d) Used following sequence to ensure all disk caches flushed: # dd if=/dev/random of=target bs=1M count=4096 # dd if=target of=/dev/null bs=1M # rm target e) Repeated truncate sequence 4 times - no failures. f) Performed orderly reboot. g) Repeated truncate sequence 2 times - 2nd failed. h) Performed orderly reboot. i) Repeated dd sequence 7 times - no failures. All failures involve the second tar in the pipeline hanging in the 'D' state. In each case I do a cold reboot before proceeding with the next test. It's tempting to speculate that a bug exists in code related to handling sparse files specifically, but perhaps it just raises the probability of tripping a bug that would eventually manifest in the dd sequence as well. OTOH, I don't know how to rule out a physical disk or disk firmware problem. This problem has occurred with different data sets and different sized memory disks, but only with the source and destination filesystems being UFS2. I have done similar sequences with EXT2 and FAT16 destinations with no failures thus far, but the memory disks and data sets were smaller so it's conceivable that probability worked against me. I should note that the drives are Seagate ST31000340AS Barracudas, but both drives have been upgraded to firmware version SD1A and are therefore supposedly free of the infamous little horror Seagate inflicted on so many of us. smartctl tells me that both disks still have a raw value of 0 for Reallocated_Sector_Ct and both pass the "short" self test. Carl / K0802647
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49A10626.8060705>