From owner-freebsd-stable@FreeBSD.ORG Mon Dec 24 23:28:03 2012 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C68BC347; Mon, 24 Dec 2012 23:28:03 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id DCC1D8FC0A; Mon, 24 Dec 2012 23:28:02 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id BAA07223; Tue, 25 Dec 2012 01:28:01 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TnHRB-000Hxu-40; Tue, 25 Dec 2012 01:28:01 +0200 Message-ID: <50D8E500.1070408@FreeBSD.org> Date: Tue, 25 Dec 2012 01:28:00 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Derek Kulinski Subject: Re: FreeBSD 9.1-RELEASE crashes almost daily; backtraces always list zfs routines References: <1824023197.20121223142308@takeda.tk> <50D87C56.70709@FreeBSD.org> <331959998.20121224101719@takeda.tk> In-Reply-To: <331959998.20121224101719@takeda.tk> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Dec 2012 23:28:03 -0000 on 24/12/2012 20:17 Derek Kulinski said the following: > Hello Andriy, > > Monday, December 24, 2012, 8:01:26 AM, you wrote: > >> on 24/12/2012 00:23 Derek Kulinski said the following: >>> Dumping 3701 out of 8072 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% > >> So do you have the crash dump(s)? > > Yes, but they are 3.5GB each. I attached text dump to GNATS but I can > resend it to you (I don't know if it's ok to send attachments to the > mailing list). If you would prefer I could give you access to the > box. Derek, I've looked through the cores and it does look like in all cases some sort of memory corruption is a precursor to a subsequent crash. I can't decidedly say if the corruptions are caused by the hardware, by some code overwriting random memory locations ("rogue" driver) or by a "simpler" bug like use after free. I am always inclined to suspect the hardware first. You can try to reproduce the problem with some additional checks enabled in the kernel. Those should catch the problem earlier and thus make its source clearer. I recommend the following: options INVARIANTS options INVARIANT_SUPPORT options WITNESS options DEBUG_MEMGUARD makeoptions DEBUG+="-DDEBUG" The last is really needed only for the ZFS and OpenSolaris compat code. It make result in some extra noise from unrelated subsystems. Perhaps you could just add "#define DEBUG" to sys/cddl/contrib/opensolaris/uts/common/sys/debug.h. I haven't tested this approach though. Also, please put vm.memguard.desc="arc_buf_hdr_t" into loader.conf. Please note that these options will make your system significantly slower. -- Andriy Gapon