From owner-freebsd-fs@FreeBSD.ORG  Tue Oct 18 00:36:03 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5D6CB106564A
	for <freebsd-fs@freebsd.org>; Tue, 18 Oct 2011 00:36:03 +0000 (UTC)
	(envelope-from haroldp@internal.org)
Received: from pluto.internal.org (mail.internal.org [64.191.53.117])
	by mx1.freebsd.org (Postfix) with ESMTP id 0785A8FC13
	for <freebsd-fs@freebsd.org>; Tue, 18 Oct 2011 00:36:02 +0000 (UTC)
Received: from [10.0.0.79] (99-46-24-87.lightspeed.renonv.sbcglobal.net
	[99.46.24.87])
	by pluto.internal.org (Postfix) with ESMTPA id 79A5DECBD4
	for <freebsd-fs@freebsd.org>; Mon, 17 Oct 2011 17:17:32 -0700 (PDT)
From: Harold Paulson <haroldp@internal.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Mon, 17 Oct 2011 17:17:31 -0700
Message-Id: <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org>
To: freebsd-fs@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1084)
X-Mailer: Apple Mail (2.1084)
Subject: Damaged directory on ZFS
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Oct 2011 00:36:03 -0000

Hello,=20

I've had a server that boots from ZFS panicking for a couple days.  I =
have worked around the problem for now, but I hope someone can give me =
some insight into what's going on, and how I can solve it properly. =20

The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA =
disks in a raid10 type arrangement:

# uname -a             =20
FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD 8.2-STABLE-201105 =
#0: Tue May 17 05:18:48 UTC 2011     =
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

And zpool status:=20

	NAME           STATE     READ WRITE CKSUM
	tank           ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    gpt/disk0  ONLINE       0     0     0
	    gpt/disk1  ONLINE       0     0     0
	  mirror       ONLINE       0     0     0
	    gpt/disk2  ONLINE       0     0     0
	    gpt/disk3  ONLINE       0     0     0

It started panicking under load a couple days ago.  We replaced RAM and =
motherboard, but problems persisted.  I don't know if a hardware issue =
originally caused the problem or what.  When it panics, I get the usual =
panic message, but I don't get a core file, and it never reboots itself. =
=20

http://pastebin.com/F1J2AjSF

While I was trying to figure out the source of the problem, I notice =
stuck various stuck processes that peg a CPU and can't be killed, such =
as:

  PID JID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU =
COMMAND
48735   0 root        1  46    0 11972K   924K CPU3    3 415:14 100.00% =
find

They are not marked zombie, but I can't kill them, and restarting the =
jail they are in won't even get rid of them.  truss just hangs with no =
output on them.  On different occasions, I noticed pop3d processes for =
the same user getting stuck in this way.  On a hunch I ran a "find" =
through the files in the user's Maildir and got a panic.  I disabled =
this account and now the server is stable again.  At least until =
locate.updatedb walks through that directory, I suppose.   Evidentially, =
there is some kind of hole in the file system below that directory tree =
causing the panic. =20

I can move that directory out of the way, and carry on, but is there =
anything I can do to really *repair* the problem?

Thanks.

	- H