Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Jul 2012 14:08:12 +0100
From:      "Clayton Milos" <clay@milos.co.za>
To:        <freebsd-stable@freebsd.org>
Subject:   ZFS causing panic
Message-ID:  <00f701cd68d4$4a5dd030$df197090$@milos.co.za>

next in thread | raw e-mail | index | archive | help
Hi guys

I've had an issue for some time now. When I'm copying a lot of files over to
ZFS usually using SMB it causes a panic and locks up the server.
I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just pulled
unnecessary drivers out of the config and added:
cpu             HAMMER
device          pf
device          pflog
options         DEVICE_POLLING
options         HZ=1000

For full disclosure I am getting these errors in the syslog which means
there's an ECC error occurring somewhere which I am trying to locate. I have
replaced both of the CPU's and all of the RAM and am still getting it so
perhaps the north bridge has bought the farm. I don't think that this is the
issue though because I was getting panics before on other hardware. Current
hardware is an 80G OS drive, 2x Opteron 285's and 16G (8x2G) of RAM on a
Tyan 2892 motherboard. Raid card is an Areca 1120.
I am running 2 pools. Both of them are 4 drive hardware RAID5. The one I'm
having issues with is 4x3TB drives seen as a 9TB scsi drive:
da0 at arcmsr0 bus 0 scbus6 target 0 lun 0
da0: <Areca HOMER R001> Fixed Direct Access SCSI-5 device 
da0: 166.666MB/s transfers (83.333MHz, offset 32, 16bit)
da0: Command Queueing enabled
da0: 8583068MB (17578123776 512 byte sectors: 255H 63S/T 1094187C)

This is encrypted with GELI to make /dev/da0.eli upon which the pool is
created. It looks like it's lost the pool now since the last panic:
  pool: homer
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
        a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scan: scrub repaired 0 in 7h0m with 0 errors on Mon Jul 23 05:25:27 2012
config:

        NAME        STATE     READ WRITE CKSUM
        homer       FAULTED      0     0     2
          da0.eli   ONLINE       0     0     8

Also I was running a script to check the kernel memory every 2 seconds. It
appears that it was well within the 1G I have assigned it in
/boot/loader.conf:
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695219900, 663.013 MB
TOTAL=695219900, 663.013 MB
TOTAL=695345852, 663.133 MB
TOTAL=695412412, 663.197 MB
TOTAL=695228092, 663.021 MB
TOTAL=695228092, 663.021 MB
TOTAL=695226044, 663.019 MB

My /boot/loader.conf contains:
ng_bpf_load="YES"
amdtemp_load="YES"
ubsec_load="YES"
vm.kmem_size="1024M"
vm.kmem_size_max="1024M"
vfs.zfs.arc_max="600M"
vfs.zfs.vdev.cache.size="8M"
vfs.zfs.txg.timeout="5"
kern.maxvnodes="250000"

This system is a home server so I can run a debug kernel if required and
crash it again.
My first question is am I doing something wrong? I think the values I've put
in are sufficient but I could well have done it wrong.
The server is also not writing the crash dump out by the looks. It hung on
1% and I had to power cycle it.
This is the panic:
panic: solaris assert: 0 == zap_increment_int(os, (-1ULL), user, delta, tx)
(0x0 == 8x7a), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dm
u_object.c, line: 1224
cpuid = 3
KDB: stack backtrace
#0 0xffffffff8055b74e at kdb_backtrace+0x5e
#1 0xffffffff80525c47 at panic+0x187
#2 0xffffffff80e71b9d at do_userquota_update+0xad
#3 0xffffffff80e71dae at dmu_objset_do_userquota_updates+0x1de
#4 0xffffffff80e882af at dso_pool_sync+0x11f
#5 0xffffffff80e976e4 at spa_sunc+0x334
#6 0xffffffff80ea7ed3 at txg_sync_thread+0x253
#7 0xffffffff804f89ee at fork_exit+0x11e
#8 0xffffffff8075847e at fork_trampoline+0xe
Uptime: 14h31m10s
Dumping 2489 out of 16370 MB:..1%

Thanks for any help.
//Clay




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?00f701cd68d4$4a5dd030$df197090$>