Date: Sun, 08 Aug 2010 11:57:30 +0200 From: Rene Ladan <rene@freebsd.org> To: freebsd-current@freebsd.org Subject: Re: Filesystem wedge, SUJ-related? Message-ID: <4C5E7F8A.7030800@freebsd.org> In-Reply-To: <alpine.LNX.2.00.1007181352110.19876@ury.york.ac.uk> References: <alpine.LNX.2.00.1007171747130.19876@ury.york.ac.uk> <alpine.LNX.2.00.1007181352110.19876@ury.york.ac.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On 18-07-2010 15:02, Gavin Atkinson wrote: > On Sat, 17 Jul 2010, Gavin Atkinson wrote: >> Semi-regularly (every two-three days) I'm seeing what appears to be some >> sort of filesystem wedge. I usually see it initially with web browsers, >> but it's possible that's only because it's what produces most disk >> activity on this machine. I've seen it with both Opera and Firefox. >> I've been seeing this too. It still happens with kernel r211000. >> What happens is that the process will just wedge. A "procstat -kk" on it >> shows the following stack backtrace: >> >> 9012 100243 firefox-bin initial thread mi_switch+0x21d >> sleepq_switch+0x123 sleepq_wait+0x4d _sleep+0x357 getdirtybuf+0x21e >> flush_deplist+0x6f softdep_sync_metadata+0x153 ffs_syncvnode+0x213 >> ffs_fsync+0x43 fsync+0x148 syscallenter+0x1b5 syscall+0x4c >> Xfast_syscall+0xe2 > Firefox is the usual first sign: acer % ps ax|grep firefox 82117 v0 T 16:24,08 /usr/local/lib/firefox3/firefox-bin 13416 3 S+ 0:00,00 grep firefox acer % procstat -kk 82117 PID TID COMM TDNAME KSTACK 82117 100195 firefox-bin - mi_switch+0x219 thread_suspend_switch+0x103 thread_single+0x25c exit1+0x81 sigexit+0x84 cursig+0 ast+0x1aa doreti_ast+0x1f 82117 100221 firefox-bin initial thread mi_switch+0x219 sleepq_switch+0xfa sleepq_wait+0x46 _sleep+0x256 getdirtybuf+0x1af flush_deplist+0x6a softdep_sync_metadata+0x153 ffs_syncvnode+0x22d ffs_fsync+0x43 fsync+0x13d syscallenter+0x194 syscall+0x41 Xfast_syscall+0xe2 acer % > A bit more detail: it does look like whatever is supposed to periodically > flush the journal just stops doing it's job. Presumably this is also the > root cause of the "softdep: Out of journal space!" messages I have been > seeing in the past, which I had assumed may have been fixed by r209717. > I haven't seen any "softdep: Out of journal space!" messages since June 24, but I've indeed seen it once before (somewhere after June 11). > (I'm running r209723 at the moment) > > While processes are starting to hang, "sh ffs" from ddb shows: > > db> sh ffs > mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000 su_wl 0 su_wl_in 0 su_deps 0 su_req 0 > mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c64800 su_wl 0 su_wl_in 0 su_deps 0 su_req 0 > mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c66000 su_wl 0 su_wl_in 0 su_deps 17345 su_req 0 > mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c66800 su_wl 0 su_wl_in 0 su_deps 55 su_req 0 > > Leaving it another couple of hours, I then see: > > db> sh ffs > mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000 su_wl 0 su_wl_in 0 su_deps 0 su_req 0 > mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c64800 su_wl 0 su_wl_in 0 su_deps 36 su_req 0 > mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c66000 su_wl 0 su_wl_in 0 su_deps 31899 su_req 0 > mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c66800 su_wl 0 su_wl_in 0 su_deps 95 su_req 0 > > so, su_deps is increasing significantly. > > During reboot, vnlru failed to stop within 60 seconds, and gave up on > syncing 125 vnodes and 140 buffers (no idea if these are related). On > reboot, SU+J fsck shows for /usr: > > ** SU+J Recovering /dev/ad4s1f > ** Reading 33554432 byte journal from inode 150. > ** Building recovery table. > ** Resolving unreferenced inode list. > ** Processing journal entries. > ** 405991 journal records in 18194944 bytes for 71.40% utilization > ** Freed 3872 inodes (0 dirs) 48157 blocks, and 8744 frags. > Similar here. > So it seems clear that somehow the journal is filling up, and never being > written. > > Any other suggestions as to where I should go from here? > Disabling the journal would be a "solution", but not desirable. Maybe any lock order reversals to look out for (most are ufs-related) ? I don't know if it is related, but yesterday a full fsck on /usr cleared up two unallocated files in /usr/ports/editors/openoffice-3/work/ (they were in userland as having a bad file descriptor), which the journal didn't catch. Regards, Rene -- http://www.rene-ladan.nl/ GPG fingerprint = ADBC ECCD EB5F A6B4 549F 600D 8C9E 647A E564 2BFC (subkeys.pgp.net)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C5E7F8A.7030800>