From owner-freebsd-stable@FreeBSD.ORG  Mon Jul 23 13:10:07 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id A2F23106566C
	for <freebsd-stable@freebsd.org>; Mon, 23 Jul 2012 13:10:07 +0000 (UTC)
	(envelope-from clay@milos.co.za)
Received: from lisa.milos.co.za (lisa.milos.co.za [109.169.49.137])
	by mx1.freebsd.org (Postfix) with ESMTP id EBB928FC12
	for <freebsd-stable@freebsd.org>; Mon, 23 Jul 2012 13:10:06 +0000 (UTC)
Received: (qmail 87689 invoked by uid 89); 23 Jul 2012 13:08:47 -0000
Received: from unknown (HELO ClayDesktop) (clay@milos.co.za@192.168.200.6)
	by lisa.milos.co.za with ESMTPA; 23 Jul 2012 13:08:47 -0000
From: "Clayton Milos" <clay@milos.co.za>
To: <freebsd-stable@freebsd.org>
Date: Mon, 23 Jul 2012 14:08:12 +0100
Message-ID: <00f701cd68d4$4a5dd030$df197090$@milos.co.za>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
thread-index: Ac1o1DS50X2WSRvFQL2TAXYU4FbH9w==
Content-Language: en-gb
Subject: ZFS causing panic
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 23 Jul 2012 13:10:07 -0000

Hi guys

I've had an issue for some time now. When I'm copying a lot of files over to
ZFS usually using SMB it causes a panic and locks up the server.
I'm running FreeBSD 9.0-RELEASE with a custom kernel. I've just pulled
unnecessary drivers out of the config and added:
cpu             HAMMER
device          pf
device          pflog
options         DEVICE_POLLING
options         HZ=1000

For full disclosure I am getting these errors in the syslog which means
there's an ECC error occurring somewhere which I am trying to locate. I have
replaced both of the CPU's and all of the RAM and am still getting it so
perhaps the north bridge has bought the farm. I don't think that this is the
issue though because I was getting panics before on other hardware. Current
hardware is an 80G OS drive, 2x Opteron 285's and 16G (8x2G) of RAM on a
Tyan 2892 motherboard. Raid card is an Areca 1120.
I am running 2 pools. Both of them are 4 drive hardware RAID5. The one I'm
having issues with is 4x3TB drives seen as a 9TB scsi drive:
da0 at arcmsr0 bus 0 scbus6 target 0 lun 0
da0: <Areca HOMER R001> Fixed Direct Access SCSI-5 device 
da0: 166.666MB/s transfers (83.333MHz, offset 32, 16bit)
da0: Command Queueing enabled
da0: 8583068MB (17578123776 512 byte sectors: 255H 63S/T 1094187C)

This is encrypted with GELI to make /dev/da0.eli upon which the pool is
created. It looks like it's lost the pool now since the last panic:
  pool: homer
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from
        a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scan: scrub repaired 0 in 7h0m with 0 errors on Mon Jul 23 05:25:27 2012
config:

        NAME        STATE     READ WRITE CKSUM
        homer       FAULTED      0     0     2
          da0.eli   ONLINE       0     0     8

Also I was running a script to check the kernel memory every 2 seconds. It
appears that it was well within the 1G I have assigned it in
/boot/loader.conf:
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695217852, 663.011 MB
TOTAL=695219900, 663.013 MB
TOTAL=695219900, 663.013 MB
TOTAL=695345852, 663.133 MB
TOTAL=695412412, 663.197 MB
TOTAL=695228092, 663.021 MB
TOTAL=695228092, 663.021 MB
TOTAL=695226044, 663.019 MB

My /boot/loader.conf contains:
ng_bpf_load="YES"
amdtemp_load="YES"
ubsec_load="YES"
vm.kmem_size="1024M"
vm.kmem_size_max="1024M"
vfs.zfs.arc_max="600M"
vfs.zfs.vdev.cache.size="8M"
vfs.zfs.txg.timeout="5"
kern.maxvnodes="250000"

This system is a home server so I can run a debug kernel if required and
crash it again.
My first question is am I doing something wrong? I think the values I've put
in are sufficient but I could well have done it wrong.
The server is also not writing the crash dump out by the looks. It hung on
1% and I had to power cycle it.
This is the panic:
panic: solaris assert: 0 == zap_increment_int(os, (-1ULL), user, delta, tx)
(0x0 == 8x7a), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/dm
u_object.c, line: 1224
cpuid = 3
KDB: stack backtrace
#0 0xffffffff8055b74e at kdb_backtrace+0x5e
#1 0xffffffff80525c47 at panic+0x187
#2 0xffffffff80e71b9d at do_userquota_update+0xad
#3 0xffffffff80e71dae at dmu_objset_do_userquota_updates+0x1de
#4 0xffffffff80e882af at dso_pool_sync+0x11f
#5 0xffffffff80e976e4 at spa_sunc+0x334
#6 0xffffffff80ea7ed3 at txg_sync_thread+0x253
#7 0xffffffff804f89ee at fork_exit+0x11e
#8 0xffffffff8075847e at fork_trampoline+0xe
Uptime: 14h31m10s
Dumping 2489 out of 16370 MB:..1%

Thanks for any help.
//Clay