Date: Sat, 22 Oct 2011 23:54:20 -0700 From: Harold Paulson <haroldp@internal.org> To: freebsd-fs@freebsd.org Subject: Re: Damaged directory on ZFS Message-ID: <4E2EF065-5C7D-4C5A-B1ED-89FC4BBBEEA1@internal.org> In-Reply-To: <20111018005448.GA2855@icarus.home.lan> References: <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org> <20111018005448.GA2855@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy,=20 If I've taken a while to respond it was because there was a ton of great = information in your post and I've spent a lot of time testing stuff out. = =20 On Oct 17, 2011, at 5:54 PM, Jeremy Chadwick wrote: > On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote: >> I've had a server that boots from ZFS panicking for a couple days. I = have worked around the problem for now, but I hope someone can give me = some insight into what's going on, and how I can solve it properly. =20 >>=20 >> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA = disks in a raid10 type arrangement: >>=20 >> # uname -a =20 >> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD = 8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011 = root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 >=20 > First thing to do is to consider upgrading to a newer RELENG_8 date. > There have been *many* ZFS fixes since May. I've done this, ran a scrub that completed without error, and still, = listing that directory panics the machine. =20 >> It started panicking under load a couple days ago. We replaced RAM = and motherboard, but problems persisted. I don't know if a hardware = issue originally caused the problem or what. When it panics, I get the = usual panic message, but I don't get a core file, and it never reboots = itself. =20 >>=20 >> http://pastebin.com/F1J2AjSF >=20 > ZFS developers will need to comment on the state of the backtrace. = You > may be requested to examine the core using kgdb and be given some > commands to run on it. Yeah, I made a real effort to get a core, but I just don't think it's = going to happen. It's an all-zfs system for starters. I actually = pulled a drive out of the array and reformatted it to try to get a core, = but it freezes on panic and never reboots after 15 seconds or any of = that. =20 >> While I was trying to figure out the source of the problem, I notice = stuck various stuck processes that peg a CPU and can't be killed, such = as: >>=20 >> PID JID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU = COMMAND >> 48735 0 root 1 46 0 11972K 924K CPU3 3 415:14 = 100.00% find >=20 > Had you done procstat -k -k 48735 (the "double -k" is not a typo), you > probably would have seen that the process was "stuck" in a ZFS-related > thread. These are processes which the kernel is hanging on to and = will > not let go of, so even kill -9 won't kill these. >=20 > It would have also be worthwhile to get the "process tree" of what > spawned the PID. (Solaris has ptree; I think we have something = similar > under FreeBSD but I forget what) The reason that matters is that it's > probably a periodic job that runs (there are many which use find), > traversing your ZFS filesystems, and tickling a bug/issue somewhere. > You even hint at this in your next paragraph, re: locate.updatedb. The processes are just ones that touch that poison directory (or some = file within it), "pop3d" or "find" from nightly periodic runs. pstree = is in ports and an old favorite of mine, and reports what I'd expect = from those. =20 procstat isn't any more interesting. Here was the one I managed to get: # procstat -k -k 44571 PID TID COMM TDNAME KSTACK = =20 44571 101006 find - <running> = =20 >> I can move that directory out of the way, and carry on, but is there = anything I can do to really *repair* the problem? >=20 > I would recommend starting with "zpool scrub" on the pool which is > associated with the Maildir/ directory of the account you disable. I > will not be surprised if it comes back 100% clean. Yep, scrubs complete without error. > Given what the backtrace looks like, I would say the Maildir/ has a = ton > of files in it. Is that the case? Does "echo *" say something about > argument list too long? Nah, it's only like 12M of email (restored from a snap). Listing the = dir is an insta-panic. =20 > However, someone familiar with the ZFS internals, as I said, should > investigate the crash you're experiencing regardless. I'd still like to find a fix. I moved the dir to /var/blackhole and = excepted it from locate.updatedb and other periodic scans, so the system = isn't panicking, but it's a crummy situation. =20 - H
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E2EF065-5C7D-4C5A-B1ED-89FC4BBBEEA1>