Date: Wed, 13 Dec 2006 17:28:39 +1100 From: "Jan Mikkelsen" <janm@transactionware.com> To: <freebsd-stable@freebsd.org> Subject: g_vfs_done() failures on 6.2-RC1 Message-ID: <00a601c71e7f$ed63f7a0$3301a8c0@janmxp>
next in thread | raw e-mail | index | archive | help
Hi, I have seen g_vfs_done() failures with absurd offsets in the face of heavy I/O. Recovery doesn't seem to happen, leading to the need to reboot the system. The problem seems to occur without any underlying disk device failure. An example from yesterday: This message repeats on the order of 10s of thousands of times, with no earlier message: g_vfs_done():da1s1d[READ(offset=5036583429229836288, length=16384)]error = 5 Bsdlabel correctly reports that /dev/da1s1d has 1748318312 512-byte blocks, so the offset is clearly wrong. The filesystem was using softupdates and was doing a few "rm -rf"s on two cvs repositories. After this error the rm didn't die from SIGTERM or SIGKILL. (Unfortunately, I didn't check wchan for the rm process. Sorry.) The shutdown took three hours. I didn't have console access, so I don't know the console messages at the time. The machine did respond to pings during at least the first hour. After it came back up, the filesystems were all reported as clean. Attempting to finish off the "rm" produced this result: bad block 8819084429375818952, ino 92865791 pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block bad block -8123569960048088809, ino 92865791 pid 49 (softdepflush), uid 0 inumber 92865791 on /work: bad block handle_workitem_freeblocks: block count g_vfs_done():da1s1d[READ(offset=1154660658434844672, length=16384)]error = 5 bad block -9114721846648257515, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block g_vfs_done():da1s1d[READ(offset=8698001308483434496, length=16384)]error = 5 bad block -8102232258315484873, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block g_vfs_done():da1s1d[READ(offset=4586979512427630592, length=16384)]error = 5 bad block -3438510379221006390, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block g_vfs_done():da1s1d[READ(offset=196654394503331840, length=16384)]error = 5 g_vfs_done():da1s1d[READ(offset=26142581273591808, length=16384)]error = 5 bad block 504981533259792482, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block bad block 1538054898336656903, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block g_vfs_done():da1s1d[READ(offset=249387551018614784, length=16384)]error = 5 bad block 18582847101533720, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block g_vfs_done():da1s1d[READ(offset=259247319150690304, length=16384)]error = 5 bad block -3429473246997783577, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block bad block -3335830404336954747, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block bad block -1007814018434232494, ino 92865789 pid 49 (softdepflush), uid 0 inumber 92865789 on /work: bad block handle_workitem_freeblocks: block count A reboot to single user mode and an fsck cleaned things up. In this case it is a machine running 6.2-RC1/amd64 with patches on a SuperMicro motherboard, 2 x Xeon 5140 CPUs, 4GB ECC memory and an Areca SATA raid controller. The raid array is RAID-6, with write-through controller cache and drive write cache disabled. The controller reported no I/O errors, no volumes are degraded. I have also seen very similar problems on a dual-Opteron machine with ataraid (in that case, 6.1-RELEASE), again undegraded and no device I/O errors reported. The patches: - Daichi Goto's unionfs-p16 has been applied. - The Areca driver is 1.20.00.12 from the Areca website. - sym(4) patch (see PR/89550), but no sym controller present. - SMP + FAST_IPSEC + SUIDDIR + device crypto. So: I've seen this problem on a few machines under heavy I/O load, with ataraid and with arcmsr. I've seen others report similar problems, but I've seen no resolution. Does anyone have any idea what the problem is? Has anyone else seen similar problems? Where to from here? Thanks, Jan Mikkelsen janm@transactionware.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?00a601c71e7f$ed63f7a0$3301a8c0>