From owner-freebsd-stable@FreeBSD.ORG Mon Jun 17 19:31:00 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 79BA4E42 for ; Mon, 17 Jun 2013 19:31:00 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) by mx1.freebsd.org (Postfix) with ESMTP id 568401AB0 for ; Mon, 17 Jun 2013 19:31:00 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id CB456B953; Mon, 17 Jun 2013 15:30:59 -0400 (EDT) From: John Baldwin To: Andre Albsmeier Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Date: Mon, 17 Jun 2013 15:30:31 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; ) References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> In-Reply-To: <20130616063942.GA72803@bali> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201306171530.31208.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 17 Jun 2013 15:30:59 -0400 (EDT) Cc: "freebsd-stable@freebsd.org" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Jun 2013 19:31:00 -0000 On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote: > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote: > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote: > > > Each day at 5:15 we are generating snapshots on various machines. > > > This used to work perfectly under 7-STABLE for years but since > > > we started to use 9.1-STABLE the machine reboots in about 10% > > > of all cases. > > > > > > After rebooting we find a new snapshot file which is a bit > > > smaller than the good ones and with different permissions > > > It does not succeed a fsck. In this example it is the one > > > whose name is beginning with s3: > > > > > > -r--r----- 1 root operator snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04 > > > -r-------- 1 root operator snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03 > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44 > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03 > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03 > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel > > > I see the following LORs (mksnap_ffs starts exactly at 5:15): > > > > > > May 29 05:15:00 palveli kernel: lock order reversal: > > > May 29 05:15:00 palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240 > > > May 29 05:15:00 palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414 > > > May 29 05:15:04 palveli kernel: lock order reversal: > > > May 29 05:15:04 palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976 > > > May 29 05:15:04 palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626 > > > > > > Unfortunatley no corefiles are being generated ;-(. > > > > > > I have checked and even rebuilt the (UFS1) fs in question > > > from scratch. I have also seen this happen on an UFS2 on > > > another machine and on a third one when running "dump -L" > > > on a root fs. > > > > > > Any hints of how to proceed? > > > > Would it be possible to setup a serial console that is logged on this machine > > to see if it is panic'ing but failing to write out a crashdump? > > Couldn't attach the serial console yet ;-(. But I had people > attach a KVMoverIP switch and enabled the various KDB options > in the kernel. Now we can see a bit more (see below) -- no > crashdump is being generated though. :( Unfortunately these LORs don't really help with discerning the cause of the reboot. If you have remote power access (and still wanted to test this) one option would be to change KDB to drop into the debugger on a panic. Then you could connect over the KVM and take images of the original panic along with a stack trace. -- John Baldwin