Date: Mon, 23 Jul 2012 14:08:12 +0100 From: "Clayton Milos" <clay@milos.co.za> To: <freebsd-stable@freebsd.org> Subject: ZFS causing panic Message-ID: <00f701cd68d4$4a5dd030$df197090$@milos.co.za>
next in thread | raw e-mail | index | archive | help
Hi guys I've had an issue for some time now. When I'm copying a lot of files over to ZFS usually using SMB it causes a panic and locks up the server. I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just pulled unnecessary drivers out of the config and added: cpu HAMMER device pf device pflog options DEVICE_POLLING options HZ=1000 For full disclosure I am getting these errors in the syslog which means there's an ECC error occurring somewhere which I am trying to locate. I have replaced both of the CPU's and all of the RAM and am still getting it so perhaps the north bridge has bought the farm. I don't think that this is the issue though because I was getting panics before on other hardware. Current hardware is an 80G OS drive, 2x Opteron 285's and 16G (8x2G) of RAM on a Tyan 2892 motherboard. Raid card is an Areca 1120. I am running 2 pools. Both of them are 4 drive hardware RAID5. The one I'm having issues with is 4x3TB drives seen as a 9TB scsi drive: da0 at arcmsr0 bus 0 scbus6 target 0 lun 0 da0: <Areca HOMER R001> Fixed Direct Access SCSI-5 device da0: 166.666MB/s transfers (83.333MHz, offset 32, 16bit) da0: Command Queueing enabled da0: 8583068MB (17578123776 512 byte sectors: 255H 63S/T 1094187C) This is encrypted with GELI to make /dev/da0.eli upon which the pool is created. It looks like it's lost the pool now since the last panic: pool: homer state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-72 scan: scrub repaired 0 in 7h0m with 0 errors on Mon Jul 23 05:25:27 2012 config: NAME STATE READ WRITE CKSUM homer FAULTED 0 0 2 da0.eli ONLINE 0 0 8 Also I was running a script to check the kernel memory every 2 seconds. It appears that it was well within the 1G I have assigned it in /boot/loader.conf: TOTAL=695217852, 663.011 MB TOTAL=695217852, 663.011 MB TOTAL=695217852, 663.011 MB TOTAL=695219900, 663.013 MB TOTAL=695219900, 663.013 MB TOTAL=695345852, 663.133 MB TOTAL=695412412, 663.197 MB TOTAL=695228092, 663.021 MB TOTAL=695228092, 663.021 MB TOTAL=695226044, 663.019 MB My /boot/loader.conf contains: ng_bpf_load="YES" amdtemp_load="YES" ubsec_load="YES" vm.kmem_size="1024M" vm.kmem_size_max="1024M" vfs.zfs.arc_max="600M" vfs.zfs.vdev.cache.size="8M" vfs.zfs.txg.timeout="5" kern.maxvnodes="250000" This system is a home server so I can run a debug kernel if required and crash it again. My first question is am I doing something wrong? I think the values I've put in are sufficient but I could well have done it wrong. The server is also not writing the crash dump out by the looks. It hung on 1% and I had to power cycle it. This is the panic: panic: solaris assert: 0 == zap_increment_int(os, (-1ULL), user, delta, tx) (0x0 == 8x7a), file: /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dm u_object.c, line: 1224 cpuid = 3 KDB: stack backtrace #0 0xffffffff8055b74e at kdb_backtrace+0x5e #1 0xffffffff80525c47 at panic+0x187 #2 0xffffffff80e71b9d at do_userquota_update+0xad #3 0xffffffff80e71dae at dmu_objset_do_userquota_updates+0x1de #4 0xffffffff80e882af at dso_pool_sync+0x11f #5 0xffffffff80e976e4 at spa_sunc+0x334 #6 0xffffffff80ea7ed3 at txg_sync_thread+0x253 #7 0xffffffff804f89ee at fork_exit+0x11e #8 0xffffffff8075847e at fork_trampoline+0xe Uptime: 14h31m10s Dumping 2489 out of 16370 MB:..1% Thanks for any help. //Clay
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?00f701cd68d4$4a5dd030$df197090$>