Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Oct 2011 23:54:20 -0700
From:      Harold Paulson <haroldp@internal.org>
To:        freebsd-fs@freebsd.org
Subject:   Re: Damaged directory on ZFS
Message-ID:  <4E2EF065-5C7D-4C5A-B1ED-89FC4BBBEEA1@internal.org>
In-Reply-To: <20111018005448.GA2855@icarus.home.lan>
References:  <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org> <20111018005448.GA2855@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy,=20

If I've taken a while to respond it was because there was a ton of great =
information in your post and I've spent a lot of time testing stuff out. =
=20


On Oct 17, 2011, at 5:54 PM, Jeremy Chadwick wrote:

> On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote:
>> I've had a server that boots from ZFS panicking for a couple days.  I =
have worked around the problem for now, but I hope someone can give me =
some insight into what's going on, and how I can solve it properly. =20
>>=20
>> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA =
disks in a raid10 type arrangement:
>>=20
>> # uname -a             =20
>> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD =
8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011     =
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
>=20
> First thing to do is to consider upgrading to a newer RELENG_8 date.
> There have been *many* ZFS fixes since May.

I've done this, ran a scrub that completed without error, and still, =
listing that directory panics the machine. =20


>> It started panicking under load a couple days ago.  We replaced RAM =
and motherboard, but problems persisted.  I don't know if a hardware =
issue originally caused the problem or what.  When it panics, I get the =
usual panic message, but I don't get a core file, and it never reboots =
itself. =20
>>=20
>> http://pastebin.com/F1J2AjSF
>=20
> ZFS developers will need to comment on the state of the backtrace.  =
You
> may be requested to examine the core using kgdb and be given some
> commands to run on it.

Yeah, I made a real effort to get a core, but I just don't think it's =
going to happen.  It's an all-zfs system for starters.  I actually =
pulled a drive out of the array and reformatted it to try to get a core, =
but it freezes on panic and never reboots after 15 seconds or any of =
that. =20


>> While I was trying to figure out the source of the problem, I notice =
stuck various stuck processes that peg a CPU and can't be killed, such =
as:
>>=20
>>  PID JID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU =
COMMAND
>> 48735   0 root        1  46    0 11972K   924K CPU3    3 415:14 =
100.00% find
>=20
> Had you done procstat -k -k 48735 (the "double -k" is not a typo), you
> probably would have seen that the process was "stuck" in a ZFS-related
> thread.  These are processes which the kernel is hanging on to and =
will
> not let go of, so even kill -9 won't kill these.
>=20
> It would have also be worthwhile to get the "process tree" of what
> spawned the PID.  (Solaris has ptree; I think we have something =
similar
> under FreeBSD but I forget what)  The reason that matters is that it's
> probably a periodic job that runs (there are many which use find),
> traversing your ZFS filesystems, and tickling a bug/issue somewhere.
> You even hint at this in your next paragraph, re: locate.updatedb.

The processes are just ones that touch that poison directory (or some =
file within it), "pop3d" or "find" from nightly periodic runs.  pstree =
is in ports and an old favorite of mine, and reports what I'd expect =
from those. =20

procstat isn't any more interesting.  Here was the one I managed to get:

# procstat -k -k 44571
  PID    TID COMM             TDNAME           KSTACK                    =
  =20
44571 101006 find             -                <running>                 =
  =20


>> I can move that directory out of the way, and carry on, but is there =
anything I can do to really *repair* the problem?
>=20
> I would recommend starting with "zpool scrub" on the pool which is
> associated with the Maildir/ directory of the account you disable.  I
> will not be surprised if it comes back 100% clean.

Yep, scrubs complete without error.


> Given what the backtrace looks like, I would say the Maildir/ has a =
ton
> of files in it.  Is that the case?  Does "echo *" say something about
> argument list too long?

Nah, it's only like 12M of email (restored from a snap).  Listing the =
dir is an insta-panic. =20


> However, someone familiar with the ZFS internals, as I said, should
> investigate the crash you're experiencing regardless.

I'd still like to find a fix.  I moved the dir to /var/blackhole and =
excepted it from locate.updatedb and other periodic scans, so the system =
isn't panicking, but it's a crummy situation. =20

	- H







Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E2EF065-5C7D-4C5A-B1ED-89FC4BBBEEA1>