From owner-freebsd-stable@FreeBSD.ORG Wed Feb 15 19:22:21 2012 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 039CC1065673 for ; Wed, 15 Feb 2012 19:22:21 +0000 (UTC) (envelope-from paul@gromit.dlib.vt.edu) Received: from lennier.cc.vt.edu (lennier.cc.vt.edu [198.82.162.213]) by mx1.freebsd.org (Postfix) with ESMTP id B1A958FC0C for ; Wed, 15 Feb 2012 19:22:20 +0000 (UTC) Received: from vivi.cc.vt.edu (vivi.cc.vt.edu [198.82.163.43]) by lennier.cc.vt.edu (8.13.8/8.13.8) with ESMTP id q1FJLoIu007702; Wed, 15 Feb 2012 14:21:50 -0500 Received: from auth3.smtp.vt.edu (EHLO auth3.smtp.vt.edu) ([198.82.161.152]) by vivi.cc.vt.edu (MOS 4.3.3-GA FastPath queued) with ESMTP id UGV07389; Wed, 15 Feb 2012 14:21:50 -0500 (EST) Received: from pmather.tower.lib.vt.edu (pmather.tower.lib.vt.edu [128.173.51.28]) (authenticated bits=0) by auth3.smtp.vt.edu (8.13.8/8.13.8) with ESMTP id q1FJLnll022885 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO); Wed, 15 Feb 2012 14:21:50 -0500 Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Paul Mather In-Reply-To: <20120215002351.GB9938@icarus.home.lan> Date: Wed, 15 Feb 2012 14:21:49 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <274B6964-3CFF-4706-845C-61FA4F8D0617@gromit.dlib.vt.edu> References: <20120215002351.GB9938@icarus.home.lan> To: Jeremy Chadwick X-Mailer: Apple Mail (2.1084) X-Mirapoint-Received-SPF: 198.82.161.152 auth3.smtp.vt.edu paul@gromit.dlib.vt.edu 5 none X-Junkmail-Status: score=10/50, host=vivi.cc.vt.edu X-Junkmail-Signature-Raw: score=unknown, refid=str=0001.0A020206.4F3C05CE.0027,ss=1,re=0.000,fgs=0, ip=0.0.0.0, so=2011-07-25 19:15:43, dmn=2011-05-27 18:58:46, mode=single engine X-Junkmail-IWF: false Cc: stable@freebsd.org Subject: Re: ZFS + nullfs + Linuxulator = panic? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 19:22:21 -0000 On Feb 14, 2012, at 7:23 PM, Jeremy Chadwick wrote: > On Tue, Feb 14, 2012 at 09:38:18AM -0500, Paul Mather wrote: >> I have a problem with RELENG_8 (FreeBSD/amd64 running a GENERIC = kernel, last built 2012-02-08). It will panic during the daily periodic = scripts that run at 3am. Here is the most recent panic message: >>=20 >> Fatal trap 9: general protection fault while in kernel mode >> cpuid =3D 0; apic id =3D 00 >> instruction pointer =3D 0x20:0xffffffff8069d266 >> stack pointer =3D 0x28:0xffffff8094b90390 >> frame pointer =3D 0x28:0xffffff8094b903a0 >> code segment =3D base 0x0, limit 0xfffff, type 0x1b >> =3D DPL 0, pres 1, long 1, def32 0, gran 1 >> processor eflags =3D resume, IOPL =3D 0 >> current process =3D 72566 (ps) >> trap number =3D 9 >> panic: general protection fault >> cpuid =3D 0 >> KDB: stack backtrace: >> #0 0xffffffff8062cf8e at kdb_backtrace+0x5e >> #1 0xffffffff805facd3 at panic+0x183 >> #2 0xffffffff808e6c20 at trap_fatal+0x290 >> #3 0xffffffff808e715a at trap+0x10a >> #4 0xffffffff808cec64 at calltrap+0x8 >> #5 0xffffffff805ee034 at fill_kinfo_thread+0x54 >> #6 0xffffffff805eee76 at fill_kinfo_proc+0x586 >> #7 0xffffffff805f22b8 at sysctl_out_proc+0x48 >> #8 0xffffffff805f26c8 at sysctl_kern_proc+0x278 >> #9 0xffffffff8060473f at sysctl_root+0x14f >> #10 0xffffffff80604a2a at userland_sysctl+0x14a >> #11 0xffffffff80604f1a at __sysctl+0xaa >> #12 0xffffffff808e62d4 at amd64_syscall+0x1f4 >> #13 0xffffffff808cef5c at Xfast_syscall+0xfc >> Uptime: 3d19h6m0s >> Dumping 1308 out of 2028 = MB:..2%..12%..21%..31%..41%..51%..62%..71%..81%..91% >> Dump complete >> Automatic reboot in 15 seconds - press a key on the console to abort >> Rebooting... >>=20 >>=20 >> The reason for the subject line is that I have another RELENG_8 = system that uses ZFS + nullfs but doesn't panic, leading me to believe = that ZFS + nullfs is not the problem. I am wondering if it is the = combination of the three that is deadly, here. >>=20 >> Both RELENG_8 systems are root-on-ZFS installs. Each night there is = a separate backup script that runs and completes before the regular = "periodic daily" run. This script takes a recursive snapshot of the ZFS = pool and then mounts these snapshots via mount_nullfs to provide a = coherent view of the filesystem under /backup. The only difference = between the two RELENG_8 systems is that one uses rsync to back up = /backup to another machine and the other uses the Linux Tivoli TSM = client to back up /backup to a TSM server. After the backup is = completed, a script runs that unmounts the nullfs file systems and then = destroys the ZFS snapshot. >>=20 >> The first (rsync backup) RELENG_8 system does not panic. It has been = running the ZFS + nullfs rsync backup job without incident for weeks = now. The second (Tivoli TSM) RELENG_8 will reliably panic when the = subsequent "periodic daily" job runs. (It is using the 32-bit TSM 6.2.4 = Linux client running "dsmc schedule" via the linux_base-f10-10_4 = package.) The actual ZFS + nullfs Tivoli TSM backup job appears to run = successfully, making me wonder if perhaps it has some memory leak or = other subtle corruption that sets up the ensuing panic when the = "periodic daily" job later gives the system a workout. >>=20 >> If I can provide more information about the panic, please let me = know. Despite the message about dumping in the panic output above, when = the system reboots I get a "No core dumps found" message during boot. = (I have dumpdev=3D"AUTO" set in /etc/rc.conf.) My swap device is on = separate partitions but is mirrored using geom_mirror as = /dev/mirror/swap. Do crash dumps to gmirror devices work on RELENG_8? >=20 > See gmirror(8) man page, section NOTES. Read the full thing. Thanks! I've changed the balance algorithm to "prefer", so hopefully = I'll get saved crash dumps to examine from now on. >> Does anyone have any idea what is to blame for the panic, or how I = can fix or work around it? >=20 > Does the panic always happen when "ps" is run? That's what's shown in > the above panic message. Quoting: >=20 >> current process =3D 72566 (ps) >=20 > And I'm inclined to think it does, based on the backtrace: >=20 >> #5 0xffffffff805ee034 at fill_kinfo_thread+0x54 >> #6 0xffffffff805eee76 at fill_kinfo_proc+0x586 >> #7 0xffffffff805f22b8 at sysctl_out_proc+0x48 >> #8 0xffffffff805f26c8 at sysctl_kern_proc+0x278 >=20 > But if you can go through the previous panics and confirm that, it = would > be helpful to developers in tracking down the problem. Just going by memory, at least one other time it did a panic during = "df". But, most of the time I remember the panic occurring during "ps". Cheers, Paul.