From owner-freebsd-stable@FreeBSD.ORG  Wed Feb 15 19:22:21 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 039CC1065673
	for <stable@freebsd.org>; Wed, 15 Feb 2012 19:22:21 +0000 (UTC)
	(envelope-from paul@gromit.dlib.vt.edu)
Received: from lennier.cc.vt.edu (lennier.cc.vt.edu [198.82.162.213])
	by mx1.freebsd.org (Postfix) with ESMTP id B1A958FC0C
	for <stable@freebsd.org>; Wed, 15 Feb 2012 19:22:20 +0000 (UTC)
Received: from vivi.cc.vt.edu (vivi.cc.vt.edu [198.82.163.43])
	by lennier.cc.vt.edu (8.13.8/8.13.8) with ESMTP id q1FJLoIu007702;
	Wed, 15 Feb 2012 14:21:50 -0500
Received: from auth3.smtp.vt.edu (EHLO auth3.smtp.vt.edu) ([198.82.161.152])
	by vivi.cc.vt.edu (MOS 4.3.3-GA FastPath queued)
	with ESMTP id UGV07389; Wed, 15 Feb 2012 14:21:50 -0500 (EST)
Received: from pmather.tower.lib.vt.edu (pmather.tower.lib.vt.edu
	[128.173.51.28]) (authenticated bits=0)
	by auth3.smtp.vt.edu (8.13.8/8.13.8) with ESMTP id q1FJLnll022885
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
	Wed, 15 Feb 2012 14:21:50 -0500
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Paul Mather <paul@gromit.dlib.vt.edu>
In-Reply-To: <20120215002351.GB9938@icarus.home.lan>
Date: Wed, 15 Feb 2012 14:21:49 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <274B6964-3CFF-4706-845C-61FA4F8D0617@gromit.dlib.vt.edu>
References: <CB455B5A-0583-4DFB-9712-6FFCC8B67AAB@gromit.dlib.vt.edu>
	<20120215002351.GB9938@icarus.home.lan>
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
X-Mailer: Apple Mail (2.1084)
X-Mirapoint-Received-SPF: 198.82.161.152 auth3.smtp.vt.edu
	paul@gromit.dlib.vt.edu 5 none
X-Junkmail-Status: score=10/50, host=vivi.cc.vt.edu
X-Junkmail-Signature-Raw: score=unknown,
	refid=str=0001.0A020206.4F3C05CE.0027,ss=1,re=0.000,fgs=0,
	ip=0.0.0.0, so=2011-07-25 19:15:43,
	dmn=2011-05-27 18:58:46, mode=single engine
X-Junkmail-IWF: false
Cc: stable@freebsd.org
Subject: Re: ZFS + nullfs + Linuxulator = panic?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Feb 2012 19:22:21 -0000

On Feb 14, 2012, at 7:23 PM, Jeremy Chadwick wrote:

> On Tue, Feb 14, 2012 at 09:38:18AM -0500, Paul Mather wrote:
>> I have a problem with RELENG_8 (FreeBSD/amd64 running a GENERIC =
kernel, last built 2012-02-08).  It will panic during the daily periodic =
scripts that run at 3am.  Here is the most recent panic message:
>>=20
>> Fatal trap 9: general protection fault while in kernel mode
>> cpuid =3D 0; apic id =3D 00
>> instruction pointer     =3D 0x20:0xffffffff8069d266
>> stack pointer           =3D 0x28:0xffffff8094b90390
>> frame pointer           =3D 0x28:0xffffff8094b903a0
>> code segment            =3D base 0x0, limit 0xfffff, type 0x1b
>>                        =3D DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        =3D resume, IOPL =3D 0
>> current process         =3D 72566 (ps)
>> trap number             =3D 9
>> panic: general protection fault
>> cpuid =3D 0
>> KDB: stack backtrace:
>> #0 0xffffffff8062cf8e at kdb_backtrace+0x5e
>> #1 0xffffffff805facd3 at panic+0x183
>> #2 0xffffffff808e6c20 at trap_fatal+0x290
>> #3 0xffffffff808e715a at trap+0x10a
>> #4 0xffffffff808cec64 at calltrap+0x8
>> #5 0xffffffff805ee034 at fill_kinfo_thread+0x54
>> #6 0xffffffff805eee76 at fill_kinfo_proc+0x586
>> #7 0xffffffff805f22b8 at sysctl_out_proc+0x48
>> #8 0xffffffff805f26c8 at sysctl_kern_proc+0x278
>> #9 0xffffffff8060473f at sysctl_root+0x14f
>> #10 0xffffffff80604a2a at userland_sysctl+0x14a
>> #11 0xffffffff80604f1a at __sysctl+0xaa
>> #12 0xffffffff808e62d4 at amd64_syscall+0x1f4
>> #13 0xffffffff808cef5c at Xfast_syscall+0xfc
>> Uptime: 3d19h6m0s
>> Dumping 1308 out of 2028 =
MB:..2%..12%..21%..31%..41%..51%..62%..71%..81%..91%
>> Dump complete
>> Automatic reboot in 15 seconds - press a key on the console to abort
>> Rebooting...
>>=20
>>=20
>> The reason for the subject line is that I have another RELENG_8 =
system that uses ZFS + nullfs but doesn't panic, leading me to believe =
that ZFS + nullfs is not the problem.  I am wondering if it is the =
combination of the three that is deadly, here.
>>=20
>> Both RELENG_8 systems are root-on-ZFS installs.  Each night there is =
a separate backup script that runs and completes before the regular =
"periodic daily" run.  This script takes a recursive snapshot of the ZFS =
pool and then mounts these snapshots via mount_nullfs to provide a =
coherent view of the filesystem under /backup.  The only difference =
between the two RELENG_8 systems is that one uses rsync to back up =
/backup to another machine and the other uses the Linux Tivoli TSM =
client to back up /backup to a TSM server.  After the backup is =
completed, a script runs that unmounts the nullfs file systems and then =
destroys the ZFS snapshot.
>>=20
>> The first (rsync backup) RELENG_8 system does not panic.  It has been =
running the ZFS + nullfs rsync backup job without incident for weeks =
now.  The second (Tivoli TSM) RELENG_8 will reliably panic when the =
subsequent "periodic daily" job runs.  (It is using the 32-bit TSM 6.2.4 =
Linux client running "dsmc schedule" via the linux_base-f10-10_4 =
package.)  The actual ZFS + nullfs Tivoli TSM backup job appears to run =
successfully, making me wonder if perhaps it has some memory leak or =
other subtle corruption that sets up the ensuing panic when the =
"periodic daily" job later gives the system a workout.
>>=20
>> If I can provide more information about the panic, please let me =
know.  Despite the message about dumping in the panic output above, when =
the system reboots I get a "No core dumps found" message during boot.  =
(I have dumpdev=3D"AUTO" set in /etc/rc.conf.)  My swap device is on =
separate partitions but is mirrored using geom_mirror as =
/dev/mirror/swap.  Do crash dumps to gmirror devices work on RELENG_8?
>=20
> See gmirror(8) man page, section NOTES.  Read the full thing.


Thanks!  I've changed the balance algorithm to "prefer", so hopefully =
I'll get saved crash dumps to examine from now on.


>> Does anyone have any idea what is to blame for the panic, or how I =
can fix or work around it?
>=20
> Does the panic always happen when "ps" is run?  That's what's shown in
> the above panic message.  Quoting:
>=20
>> current process         =3D 72566 (ps)
>=20
> And I'm inclined to think it does, based on the backtrace:
>=20
>> #5 0xffffffff805ee034 at fill_kinfo_thread+0x54
>> #6 0xffffffff805eee76 at fill_kinfo_proc+0x586
>> #7 0xffffffff805f22b8 at sysctl_out_proc+0x48
>> #8 0xffffffff805f26c8 at sysctl_kern_proc+0x278
>=20
> But if you can go through the previous panics and confirm that, it =
would
> be helpful to developers in tracking down the problem.


Just going by memory, at least one other time it did a panic during =
"df".  But, most of the time I remember the panic occurring during "ps".

Cheers,

Paul.