FreeBSD Mail Archives

Date:      Sat, 29 Oct 2011 23:46:15 -0700
From:      Harold Paulson <haroldp@internal.org>
To:        freebsd-fs@freebsd.org
Subject:   Re: Damaged directory on ZFS
Message-ID:  <0D7D4701-925D-4BFC-A2BE-51892CD08B45@internal.org>
In-Reply-To: <20111023140222.GG1697@garage.freebsd.pl>
References:  <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org> <20111023140222.GG1697@garage.freebsd.pl>

Pawel,=20

On Oct 23, 2011, at 7:02 AM, Pawel Jakub Dawidek wrote:

> On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote:
>> Hello,=20
>>=20
>> I've had a server that boots from ZFS panicking for a couple days.  I =
have worked around the problem for now, but I hope someone can give me =
some insight into what's going on, and how I can solve it properly. =20
>>=20
>> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA =
disks in a raid10 type arrangement:
>>=20
>> # uname -a             =20
>> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD =
8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011     =
root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
>>=20
>> And zpool status:=20
>>=20
>> 	NAME           STATE     READ WRITE CKSUM
>> 	tank           ONLINE       0     0     0
>> 	  mirror       ONLINE       0     0     0
>> 	    gpt/disk0  ONLINE       0     0     0
>> 	    gpt/disk1  ONLINE       0     0     0
>> 	  mirror       ONLINE       0     0     0
>> 	    gpt/disk2  ONLINE       0     0     0
>> 	    gpt/disk3  ONLINE       0     0     0
>>=20
>> It started panicking under load a couple days ago.  We replaced RAM =
and motherboard, but problems persisted.  I don't know if a hardware =
issue originally caused the problem or what.  When it panics, I get the =
usual panic message, but I don't get a core file, and it never reboots =
itself. =20
>>=20
>> http://pastebin.com/F1J2AjSF
>>=20
>> While I was trying to figure out the source of the problem, I notice =
stuck various stuck processes that peg a CPU and can't be killed, such =
as:
>>=20
>>  PID JID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU =
COMMAND
>> 48735   0 root        1  46    0 11972K   924K CPU3    3 415:14 =
100.00% find
>>=20
>> They are not marked zombie, but I can't kill them, and restarting the =
jail they are in won't even get rid of them.  truss just hangs with no =
output on them.  On different occasions, I noticed pop3d processes for =
the same user getting stuck in this way.  On a hunch I ran a "find" =
through the files in the user's Maildir and got a panic.  I disabled =
this account and now the server is stable again.  At least until =
locate.updatedb walks through that directory, I suppose.   Evidentially, =
there is some kind of hole in the file system below that directory tree =
causing the panic. =20
>>=20
>> I can move that directory out of the way, and carry on, but is there =
anything I can do to really *repair* the problem?
>=20
> Could you run these commands:
>=20
> 	objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16} =
<fzap_cursor_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR =
printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs =
addr2line -e /boot/kernel/zfs.ko.symbols
>=20
> They should convert fzap_cursor_retrieve+0x111 info file:line. Send it
> here once you obtain it.

% objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16} =
<fzap_cursor_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR =
printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs =
addr2line -e /boot/kernel/zfs.ko.symbols
=
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/=
zap.c:1158

	- H

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0D7D4701-925D-4BFC-A2BE-51892CD08B45>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation