From owner-freebsd-fs@FreeBSD.ORG Mon Oct 31 22:33:20 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1C04F106566C for ; Mon, 31 Oct 2011 22:33:20 +0000 (UTC) (envelope-from andrey.kosachenko@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 9B05D8FC14 for ; Mon, 31 Oct 2011 22:33:19 +0000 (UTC) Received: by bkbzs2 with SMTP id zs2so4031880bkb.13 for ; Mon, 31 Oct 2011 15:33:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=BoeqS8P5KdGVJ1L6TES7HrJaQyI+qyFmodiarsnSXCg=; b=f4eBaUlsb3I8n68QtDUWsVNHCDBn9xMNOPa/iM5HT9EoOuj0OFhN+V64tA1q0FzWmP asT+V+ms5dsSG9yg4akTzxMUcMjxwsnBO0U7oQ23A2b4bFFDB54BUij3TKU8PFBnCfLm 7YWKaqwUANJ41TBEVkDQwUDAp0B4zf9UE1BiY= Received: by 10.204.145.78 with SMTP id c14mr12721053bkv.42.1320098874838; Mon, 31 Oct 2011 15:07:54 -0700 (PDT) Received: from beastie.intra ([195.60.174.66]) by mx.google.com with ESMTPS id v6sm11386586bkt.1.2011.10.31.15.07.52 (version=SSLv3 cipher=OTHER); Mon, 31 Oct 2011 15:07:53 -0700 (PDT) Message-ID: <4EAF1C36.9010209@gmail.com> Date: Tue, 01 Nov 2011 00:07:50 +0200 From: Andrey Kosachenko User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:7.0) Gecko/20111001 Thunderbird/7.0 MIME-Version: 1.0 To: Harold Paulson References: <4D8047A6-930E-4DE8-BA55-051890585BFE@internal.org> <20111023140222.GG1697@garage.freebsd.pl> <0D7D4701-925D-4BFC-A2BE-51892CD08B45@internal.org> In-Reply-To: <0D7D4701-925D-4BFC-A2BE-51892CD08B45@internal.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, pjd@FreeBSD.org Subject: Re: Damaged directory on ZFS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Oct 2011 22:33:20 -0000 Hi, On 30.10.2011 08:46, Harold Paulson wrote: > Pawel, > > On Oct 23, 2011, at 7:02 AM, Pawel Jakub Dawidek wrote: > >> On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote: >>> Hello, >>> >>> I've had a server that boots from ZFS panicking for a couple days. I have worked around the problem for now, but I hope someone can give me some insight into what's going on, and how I can solve it properly. >>> >>> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA disks in a raid10 type arrangement: >>> >>> # uname -a >>> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD 8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 >>> >>> And zpool status: >>> >>> NAME STATE READ WRITE CKSUM >>> tank ONLINE 0 0 0 >>> mirror ONLINE 0 0 0 >>> gpt/disk0 ONLINE 0 0 0 >>> gpt/disk1 ONLINE 0 0 0 >>> mirror ONLINE 0 0 0 >>> gpt/disk2 ONLINE 0 0 0 >>> gpt/disk3 ONLINE 0 0 0 >>> >>> It started panicking under load a couple days ago. We replaced RAM and motherboard, but problems persisted. I don't know if a hardware issue originally caused the problem or what. When it panics, I get the usual panic message, but I don't get a core file, and it never reboots itself. >>> >>> http://pastebin.com/F1J2AjSF >>> >>> While I was trying to figure out the source of the problem, I notice stuck various stuck processes that peg a CPU and can't be killed, such as: >>> >>> PID JID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND >>> 48735 0 root 1 46 0 11972K 924K CPU3 3 415:14 100.00% find >>> >>> They are not marked zombie, but I can't kill them, and restarting the jail they are in won't even get rid of them. truss just hangs with no output on them. On different occasions, I noticed pop3d processes for the same user getting stuck in this way. On a hunch I ran a "find" through the files in the user's Maildir and got a panic. I disabled this account and now the server is stable again. At least until locate.updatedb walks through that directory, I suppose. Evidentially, there is some kind of hole in the file system below that directory tree causing the panic. >>> >>> I can move that directory out of the way, and carry on, but is there anything I can do to really *repair* the problem? the same is observed over here (I'm running CURRENT system dated by Sun Oct 16 14:53:49 EEST 2011). Attempts to run any file commands (ls, find etc) on such directory (in my case it is /usr/local/include/dirac) make those commands hang (kill -9 doesn't help). Though my system doesn't panic. >> Could you run these commands: >> >> objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16}' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/kernel/zfs.ko.symbols >> >> They should convert fzap_cursor_retrieve+0x111 info file:line. Send it >> here once you obtain it. > > % objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16}' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/kernel/zfs.ko.symbols > /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:1158 output of suggested command just the same as Harolds's, i.e.: objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16} ' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/kernel/zfs.ko.symbols /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:1158 -- WBR, Andrey Kosachenko