From owner-freebsd-stable@FreeBSD.ORG  Mon Jun 17 19:31:00 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 79BA4E42
 for <freebsd-stable@freebsd.org>; Mon, 17 Jun 2013 19:31:00 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 by mx1.freebsd.org (Postfix) with ESMTP id 568401AB0
 for <freebsd-stable@freebsd.org>; Mon, 17 Jun 2013 19:31:00 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id CB456B953;
 Mon, 17 Jun 2013 15:30:59 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Andre Albsmeier <Andre.Albsmeier@siemens.com>
Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found
Date: Mon, 17 Jun 2013 15:30:31 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p25; KDE/4.5.5; amd64; ; )
References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org>
 <20130616063942.GA72803@bali>
In-Reply-To: <20130616063942.GA72803@bali>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201306171530.31208.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Mon, 17 Jun 2013 15:30:59 -0400 (EDT)
Cc: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Jun 2013 19:31:00 -0000

On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote:
> On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > Each day at 5:15 we are generating snapshots on various machines.
> > > This used to work perfectly under 7-STABLE for years but since
> > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > of all cases.
> > > 
> > > After rebooting we find a new snapshot file which is a bit
> > > smaller than the good ones and with different permissions
> > > It does not succeed a fsck. In this example it is the one
> > > whose name is beginning with s3:
> > > 
> > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > 
> > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > 
> > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > 
> > > Unfortunatley no corefiles are being generated ;-(.
> > > 
> > > I have checked and even rebuilt the (UFS1) fs in question
> > > from scratch. I have also seen this happen on an UFS2 on
> > > another machine and on a third one when running "dump -L"
> > > on a root fs.
> > > 
> > > Any hints of how to proceed?
> > 
> > Would it be possible to setup a serial console that is logged on this machine
> > to see if it is panic'ing but failing to write out a crashdump?
> 
> Couldn't attach the serial console yet ;-(. But I had people
> attach a KVMoverIP switch and enabled the various KDB options
> in the kernel. Now we can see a bit more (see below) -- no
> crashdump is being generated though.

:(  Unfortunately these LORs don't really help with discerning the cause of
the reboot.  If you have remote power access (and still wanted to test this)
one option would be to change KDB to drop into the debugger on a panic.
Then you could connect over the KVM and take images of the original panic
along with a stack trace.

-- 
John Baldwin