From owner-freebsd-fs@FreeBSD.ORG Sun Oct 23 06:54:24 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 326D11065670 for ; Sun, 23 Oct 2011 06:54:24 +0000 (UTC) (envelope-from haroldp@internal.org) Received: from pluto.internal.org (mail.internal.org [64.191.53.117]) by mx1.freebsd.org (Postfix) with ESMTP id E4F0D8FC0A for ; Sun, 23 Oct 2011 06:54:23 +0000 (UTC) Received: from [10.0.0.79] (99-46-24-87.lightspeed.renonv.sbcglobal.net [99.46.24.87]) by pluto.internal.org (Postfix) with ESMTPA id 99C89ECC0B for ; Sat, 22 Oct 2011 23:54:21 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) From: Harold Paulson In-Reply-To: <20111018005448.GA2855@icarus.home.lan> Date: Sat, 22 Oct 2011 23:54:20 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <4E2EF065-5C7D-4C5A-B1ED-89FC4BBBEEA1@internal.org> References: <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org> <20111018005448.GA2855@icarus.home.lan> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.1084) Subject: Re: Damaged directory on ZFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Oct 2011 06:54:24 -0000 Jeremy,=20 If I've taken a while to respond it was because there was a ton of great = information in your post and I've spent a lot of time testing stuff out. = =20 On Oct 17, 2011, at 5:54 PM, Jeremy Chadwick wrote: > On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote: >> I've had a server that boots from ZFS panicking for a couple days. I = have worked around the problem for now, but I hope someone can give me = some insight into what's going on, and how I can solve it properly. =20 >>=20 >> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA = disks in a raid10 type arrangement: >>=20 >> # uname -a =20 >> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD = 8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011 = root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 >=20 > First thing to do is to consider upgrading to a newer RELENG_8 date. > There have been *many* ZFS fixes since May. I've done this, ran a scrub that completed without error, and still, = listing that directory panics the machine. =20 >> It started panicking under load a couple days ago. We replaced RAM = and motherboard, but problems persisted. I don't know if a hardware = issue originally caused the problem or what. When it panics, I get the = usual panic message, but I don't get a core file, and it never reboots = itself. =20 >>=20 >> http://pastebin.com/F1J2AjSF >=20 > ZFS developers will need to comment on the state of the backtrace. = You > may be requested to examine the core using kgdb and be given some > commands to run on it. Yeah, I made a real effort to get a core, but I just don't think it's = going to happen. It's an all-zfs system for starters. I actually = pulled a drive out of the array and reformatted it to try to get a core, = but it freezes on panic and never reboots after 15 seconds or any of = that. =20 >> While I was trying to figure out the source of the problem, I notice = stuck various stuck processes that peg a CPU and can't be killed, such = as: >>=20 >> PID JID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU = COMMAND >> 48735 0 root 1 46 0 11972K 924K CPU3 3 415:14 = 100.00% find >=20 > Had you done procstat -k -k 48735 (the "double -k" is not a typo), you > probably would have seen that the process was "stuck" in a ZFS-related > thread. These are processes which the kernel is hanging on to and = will > not let go of, so even kill -9 won't kill these. >=20 > It would have also be worthwhile to get the "process tree" of what > spawned the PID. (Solaris has ptree; I think we have something = similar > under FreeBSD but I forget what) The reason that matters is that it's > probably a periodic job that runs (there are many which use find), > traversing your ZFS filesystems, and tickling a bug/issue somewhere. > You even hint at this in your next paragraph, re: locate.updatedb. The processes are just ones that touch that poison directory (or some = file within it), "pop3d" or "find" from nightly periodic runs. pstree = is in ports and an old favorite of mine, and reports what I'd expect = from those. =20 procstat isn't any more interesting. Here was the one I managed to get: # procstat -k -k 44571 PID TID COMM TDNAME KSTACK = =20 44571 101006 find - = =20 >> I can move that directory out of the way, and carry on, but is there = anything I can do to really *repair* the problem? >=20 > I would recommend starting with "zpool scrub" on the pool which is > associated with the Maildir/ directory of the account you disable. I > will not be surprised if it comes back 100% clean. Yep, scrubs complete without error. > Given what the backtrace looks like, I would say the Maildir/ has a = ton > of files in it. Is that the case? Does "echo *" say something about > argument list too long? Nah, it's only like 12M of email (restored from a snap). Listing the = dir is an insta-panic. =20 > However, someone familiar with the ZFS internals, as I said, should > investigate the crash you're experiencing regardless. I'd still like to find a fix. I moved the dir to /var/blackhole and = excepted it from locate.updatedb and other periodic scans, so the system = isn't panicking, but it's a crummy situation. =20 - H