From owner-freebsd-stable@FreeBSD.ORG Tue Jun 18 05:28:00 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id CDCCA787; Tue, 18 Jun 2013 05:28:00 +0000 (UTC) (envelope-from Andre.Albsmeier@siemens.com) Received: from david.siemens.de (david.siemens.de [192.35.17.14]) by mx1.freebsd.org (Postfix) with ESMTP id 66255191B; Tue, 18 Jun 2013 05:27:59 +0000 (UTC) Received: from mail2.siemens.de (localhost [127.0.0.1]) by david.siemens.de (8.13.6/8.13.6) with ESMTP id r5I5RwTq017760; Tue, 18 Jun 2013 07:27:58 +0200 Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130]) by mail2.siemens.de (8.13.6/8.13.6) with ESMTP id r5I5RwnA014782; Tue, 18 Jun 2013 07:27:58 +0200 Received: (from localhost) by curry.mchp.siemens.de (8.14.7/8.14.7) id r5I5RwRA001445; Date: Tue, 18 Jun 2013 07:27:58 +0200 From: Andre Albsmeier To: John Baldwin Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Message-ID: <20130618052758.GA1467@bali> References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201306171530.31208.jhb@freebsd.org> X-Echelon: X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses! User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "freebsd-stable@freebsd.org" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Jun 2013 05:28:00 -0000 On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote: > On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote: > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote: > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote: > > > > Each day at 5:15 we are generating snapshots on various machines. > > > > This used to work perfectly under 7-STABLE for years but since > > > > we started to use 9.1-STABLE the machine reboots in about 10% > > > > of all cases. > > > > > > > > After rebooting we find a new snapshot file which is a bit > > > > smaller than the good ones and with different permissions > > > > It does not succeed a fsck. In this example it is the one > > > > whose name is beginning with s3: > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04 > > > > -r-------- 1 root operator snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03 > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44 > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03 > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03 > > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15): > > > > > > > > May 29 05:15:00 palveli kernel: lock order reversal: > > > > May 29 05:15:00 palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240 > > > > May 29 05:15:00 palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414 > > > > May 29 05:15:04 palveli kernel: lock order reversal: > > > > May 29 05:15:04 palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976 > > > > May 29 05:15:04 palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626 > > > > > > > > Unfortunatley no corefiles are being generated ;-(. > > > > > > > > I have checked and even rebuilt the (UFS1) fs in question > > > > from scratch. I have also seen this happen on an UFS2 on > > > > another machine and on a third one when running "dump -L" > > > > on a root fs. > > > > > > > > Any hints of how to proceed? > > > > > > Would it be possible to setup a serial console that is logged on this machine > > > to see if it is panic'ing but failing to write out a crashdump? > > > > Couldn't attach the serial console yet ;-(. But I had people > > attach a KVMoverIP switch and enabled the various KDB options > > in the kernel. Now we can see a bit more (see below) -- no > > crashdump is being generated though. > > :( Unfortunately these LORs don't really help with discerning the cause of > the reboot. If you have remote power access (and still wanted to test this) > one option would be to change KDB to drop into the debugger on a panic. > Then you could connect over the KVM and take images of the original panic > along with a stack trace. As described yesterday, I think I know why we don't get dumps: I dump on da1 and da1 is spun down. On FreeBSD-7 da1 started automatically in this case, on FreeBSD-9 it doesn't. I now dump on da0 which is running already... My suggestion is that I will try to get a dump now -- however, I have to arrange it with people using the machine. I'll come back when I have a dump ready... Thanks, -Andre > > -- > John Baldwin -- Win98: useless extension to a minor patch release for 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand for 1 bit of competition.