From owner-freebsd-fs@FreeBSD.ORG Sun May 18 07:11:47 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D861C1065677 for ; Sun, 18 May 2008 07:11:47 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.232]) by mx1.freebsd.org (Postfix) with ESMTP id BB6218FC16 for ; Sun, 18 May 2008 07:11:47 +0000 (UTC) (envelope-from andrew@thefrog.net) Received: by rv-out-0506.google.com with SMTP id b25so912555rvf.43 for ; Sun, 18 May 2008 00:11:47 -0700 (PDT) Received: by 10.141.137.16 with SMTP id p16mr2872136rvn.192.1211094707041; Sun, 18 May 2008 00:11:47 -0700 (PDT) Received: from qurbaga.plantsoft.org ( [121.44.4.97]) by mx.google.com with ESMTPS id g31sm10158869rvb.2.2008.05.18.00.11.41 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 18 May 2008 00:11:44 -0700 (PDT) Message-Id: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> From: Andrew Hill To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Date: Sun, 18 May 2008 17:11:37 +1000 X-Mailer: Apple Mail (2.919.2) Sender: Andrew Hill Subject: ZFS lockup in "zfs" state X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 May 2008 07:11:47 -0000 > The following patch, published some time ago by pjd helped me: > http://mbsd.msk.ru/dist/zfs_lockup.diff > > 100+ days of uptime of heavily loaded machines and no problems so far. > > Hope it would help. I applied this patch with some modifications to fix up the file names as they seem to have moved from - src/sys/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h - src/sys/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c - src/sys/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c to - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil.h - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c - src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (and pointed the kernel configuration file, MASSHOSTING_7_64, to my own kernel config) buildworld and buildkernel succeeded without error, but when i installed the new kernel and rebooted i got the following output (the important point being the failure to load zfs on the 8th line) May 17 17:02:06 <0.2> gutter kernel: Copyright (c) 1992-2008 The FreeBSD Project. May 17 17:02:06 <0.2> gutter kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 May 17 17:02:06 <0.2> gutter kernel: The Regents of the University of California. All rights reserved. May 17 17:02:06 <0.2> gutter kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. May 17 17:02:06 <0.2> gutter kernel: FreeBSD 7.0-STABLE #6: Sat May 17 16:39:32 EST 2008 May 17 17:02:06 <0.2> gutter kernel: root@gutter.thefrog.net:/usr/obj/ usr/src/sys/GUTTER May 17 17:02:06 <0.2> gutter kernel: link_elf_obj: symbol kproc_exit undefined May 17 17:02:06 <0.2> gutter kernel: KLD file zfs.ko - could not finalize loading May 17 17:02:06 <0.2> gutter kernel: Timecounter "i8254" frequency 1193182 Hz quality 0 May 17 17:02:06 <0.2> gutter kernel: CPU: AMD Athlon(tm) 64 Processor 3200+ (2010.31-MHz K8-class CPU) May 17 17:02:06 <0.2> gutter kernel: Origin = "AuthenticAMD" Id = 0x10ff0 Stepping = 0 May 17 17:02:06 <0.2> gutter kernel: Features =0x78bfbff May 17 17:02:06 <0.2> gutter kernel: AMD Features=0xe2500800 May 17 17:02:06 <0.2> gutter kernel: AMD Features2=0x1 May 17 17:02:06 <0.2> gutter kernel: usable memory = 2137882624 (2038 MB) May 17 17:02:06 <0.2> gutter kernel: avail memory = 2060988416 (1965 MB) May 17 17:02:06 <0.2> gutter kernel: ACPI APIC Table: May 17 17:02:06 <0.2> gutter kernel: ioapic0 irqs 0-23 on motherboard May 17 17:02:06 <0.2> gutter kernel: ad0: 238475MB at ata0-master UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad2: 238475MB at ata1-master UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad3: 152627MB at ata1-slave UDMA100 May 17 17:02:06 <0.2> gutter kernel: ad4: 476940MB at ata2-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad6: 715404MB at ata3-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad8: 305245MB at ata4-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad10: 305245MB at ata5-master SATA300 May 17 17:02:06 <0.2> gutter kernel: ad12: 305245MB at ata6-master SATA150 May 17 17:02:06 <0.2> gutter kernel: Trying to mount root from zfs:tank/root May 17 17:02:06 <0.2> gutter kernel: May 17 17:02:06 <0.2> gutter kernel: Manual root filesystem specification: May 17 17:02:06 <0.2> gutter kernel: : Mount using filesystem May 17 17:02:06 <0.2> gutter kernel: eg. ufs:da0s1a May 17 17:02:06 <0.2> gutter kernel: ? List valid disk boot devices May 17 17:02:06 <0.2> gutter kernel: Abort manual input May 17 17:02:06 <0.2> gutter kernel: May 17 17:02:06 <0.2> gutter kernel: mountroot> at this point, since zfs has not been loaded, obviously i could not get it to mount root from zfs:tank/root, and resorted to a backup ufs root to put my old kernel back in place i'm not sure if there is more output available than just the "could not finalize loading", if so please let me know where to look and i'd love to re-test this patch if it'll provide more information right now, i'm getting uptimes in the order of days before everything locks up, i assume its related to this bug, though i'm also getting the following output when it locks up ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350494631 ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=234920650 ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=443427007 ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350174938 ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350494631 ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=234920650 ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=443427007 ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350174938 ad2: FAILURE - WRITE_DMA48 timed out LBA=350494631 ad0: FAILURE - WRITE_DMA timed out LBA=234920650 ad2: FAILURE - WRITE_DMA48 timed out LBA=443427007 ad0: FAILURE - WRITE_DMA48 timed out LBA=350174938 typically repeated for a number of different LBA values before the system panics. I don't know if this is more likely to be related to the cause of the lockups (e.g. faulty hardware/driver) or if its an effect of the lockup (e.g. waiting on a deadlocked thread)... from what i've found searching mailing lists, this kind of error seems to turn up with faulty hardware/drivers so i guess it could just be that zfs exposes the faults because its using the hardware differently to my previous ufs setup... in terms of my specific setup, i have 2gb ram, i'm running from up-to- date -STABLE source (apart from my attempt to apply the aforementioned patch), i'm running an amd64 kernel, and my /boot/loader.conf looks like this: vm.kmem_size_max="1610612736" vm.kmem_size="1610612736" zfs_load="YES" vfs.root.mountfrom="zfs:tank/root" vfs.zfs.prefetch_disable="1" vfs.zfs.arc_max="838860800" the last line was an attempt to reduce the amount of arc cache in the kernel in case it was having trouble locating memory blocks for other things (as the default value had it at 1.2gb) but adding that parameter doesn't seem to have had any effect anyway, any info toward resolving this would be greatly appreciated, and otherwise let me know what further info i can provide to help track down the problem Andrew