From owner-freebsd-stable@FreeBSD.ORG Thu Jul 4 14:29:22 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 228262E4; Thu, 4 Jul 2013 14:29:22 +0000 (UTC) (envelope-from Andre.Albsmeier@siemens.com) Received: from thoth.sbs.de (thoth.sbs.de [192.35.17.2]) by mx1.freebsd.org (Postfix) with ESMTP id A21F018F5; Thu, 4 Jul 2013 14:29:21 +0000 (UTC) Received: from mail1.siemens.de (localhost [127.0.0.1]) by thoth.sbs.de (8.13.6/8.13.6) with ESMTP id r64ETJ41008408; Thu, 4 Jul 2013 16:29:19 +0200 Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130]) by mail1.siemens.de (8.13.6/8.13.6) with ESMTP id r64ETJD8011971; Thu, 4 Jul 2013 16:29:19 +0200 Received: (from localhost) by curry.mchp.siemens.de (8.14.7/8.14.7) id r64ETJHD016405; Date: Thu, 4 Jul 2013 16:29:19 +0200 From: Andre Albsmeier To: Konstantin Belousov Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Message-ID: <20130704142919.GA1798@bali> References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali> <20130704061550.GI91021@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130704061550.GI91021@kib.kiev.ua> X-Echelon: X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses! User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "freebsd-stable@freebsd.org" , John Baldwin X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Jul 2013 14:29:22 -0000 On Thu, 04-Jul-2013 at 08:15:50 +0200, Konstantin Belousov wrote: > On Thu, Jul 04, 2013 at 07:27:00AM +0200, Andre Albsmeier wrote: > > On Thu, 04-Jul-2013 at 07:24:40 +0200, Konstantin Belousov wrote: > > > On Thu, Jul 04, 2013 at 07:14:09AM +0200, Andre Albsmeier wrote: > > > > On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote: > > > > > On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote: > > > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote: > > > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote: > > > > > > > > Each day at 5:15 we are generating snapshots on various machines. > > > > > > > > This used to work perfectly under 7-STABLE for years but since > > > > > > > > we started to use 9.1-STABLE the machine reboots in about 10% > > > > > > > > of all cases. > > > > > > > > > > > > > > > > After rebooting we find a new snapshot file which is a bit > > > > > > > > smaller than the good ones and with different permissions > > > > > > > > It does not succeed a fsck. In this example it is the one > > > > > > > > whose name is beginning with s3: > > > > > > > > > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04 > > > > > > > > -r-------- 1 root operator snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03 > > > > > > > > > > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel > > > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15): > > > > > > > > > > > > > > > > May 29 05:15:00 palveli kernel: lock order reversal: > > > > > > > > May 29 05:15:00 palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240 > > > > > > > > May 29 05:15:00 palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414 > > > > > > > > May 29 05:15:04 palveli kernel: lock order reversal: > > > > > > > > May 29 05:15:04 palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976 > > > > > > > > May 29 05:15:04 palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626 > > > > > > > > > > > > > > > > Unfortunatley no corefiles are being generated ;-(. > > > > > > > > > > > > > > > > I have checked and even rebuilt the (UFS1) fs in question > > > > > > > > from scratch. I have also seen this happen on an UFS2 on > > > > > > > > another machine and on a third one when running "dump -L" > > > > > > > > on a root fs. > > > > > > > > > > > > > > > > Any hints of how to proceed? > > > > > > > > > > > > > > Would it be possible to setup a serial console that is logged on this machine > > > > > > > to see if it is panic'ing but failing to write out a crashdump? > > > > > > > > > > > > Couldn't attach the serial console yet ;-(. But I had people > > > > > > attach a KVMoverIP switch and enabled the various KDB options > > > > > > in the kernel. Now we can see a bit more (see below) -- no > > > > > > crashdump is being generated though. > > > > > > > > > > :( Unfortunately these LORs don't really help with discerning the cause of > > > > > the reboot. If you have remote power access (and still wanted to test this) > > > > > one option would be to change KDB to drop into the debugger on a panic. > > > > > Then you could connect over the KVM and take images of the original panic > > > > > along with a stack trace. > > > > > > > > After a few days of no problems, the box decided to crash > > > > during mksnap_ffs today ;-(. But now I have a crashdump, > > > > see below. Unfortunatley, I cannot upload the dump somewhere > > > > but if you ask me check whatever things I'll be happy to help. > > > > > > > > kgdb /usr/obj/src/src-9/sys/palveli/kernel.debug vmcore.4 > > > > GNU gdb 6.1.1 [FreeBSD] > > > > Copyright 2004 Free Software Foundation, Inc. > > > > GDB is free software, covered by the GNU General Public License, and you are > > > > welcome to change it and/or distribute copies of it under certain conditions. > > > > Type "show copying" to see the conditions. > > > > There is absolutely no warranty for GDB. Type "show warranty" for details. > > > > This GDB was configured as "i386-marcel-freebsd"... > > > > > > > > Unread portion of the kernel message buffer: > > > > > > > > > > > > Fatal trap 12: page fault while in kernel mode > > > > fault virtual address = 0xcfb5e000 > > > > fault code = supervisor write, page not present > > > > instruction pointer = 0x20:0xc07cb2fe > > > > stack pointer = 0x28:0xd83545d0 > > > > frame pointer = 0x28:0xd835490c > > > > code segment = base 0x0, limit 0xfffff, type 0x1b > > > > = DPL 0, pres 1, def32 1, gran 1 > > > > processor eflags = interrupt enabled, resume, IOPL = 0 > > > > current process = 12929 (mksnap_ffs) > > > > trap number = 12 > > > > panic: page fault > > > > KDB: stack backtrace: > > > > db_trace_self_wrapper(c08207eb,d835441c,c05fdfc9,c081df13,c08a82e0,...) at db_trace_self_wrapper+0x26/frame 0xd83543ec > > > > kdb_backtrace(c081df13,c08a82e0,c0801bfa,d8354428,d8354428,...) at kdb_backtrace+0x29/frame 0xd83543f8 > > > > panic(c0801bfa,c0845a01,c2bafae4,1,1,...) at panic+0xc9/frame 0xd835441c > > > > trap_fatal(c0ff6000,cfb5e000,2,0,265abf,...) at trap_fatal+0x353/frame 0xd835445c > > > > trap_pfault(140da,0,c2baf930,c08b6a40,c282145c,...) at trap_pfault+0x2d7/frame 0xd83544a4 > > > > trap(d8354590) at trap+0x41a/frame 0xd8354584 > > > > calltrap() at calltrap+0x6/frame 0xd8354584 > > > > --- trap 0xc, eip = 0xc07cb2fe, esp = 0xd83545d0, ebp = 0xd835490c --- > > > > bcopy(c2b36548,c2f194e0,0,0,0,...) at bcopy+0x1a/frame 0xd835490c > > > > ffs_mount(c2b36548,c2db9000,ff,d8354c08,c2b665e4,...) at ffs_mount+0x15ee/frame 0xd8354a3c > > > > > > From the crash dump in kgdb, do > > > list *ffs_mount+0x15ee > > > > (kgdb) list *ffs_mount+0x15ee > > 0xc0748e8e is in ffs_mount (/src/src-9/sys/ufs/ffs/ffs_vfsops.c:483). > > 478 > > 479 /* > > 480 * If this is a snapshot request, take the snapshot. > > 481 */ > > 482 if (mp->mnt_flag & MNT_SNAPSHOT) > > 483 return (ffs_snapshot(mp, fspec)); > > 484 } > > 485 > > 486 /* > > 487 * Not an update, or updating the name: look up the name > > It is not useful, bcopy does not create a frame, so the real caller > of the failing bcopy gets lost. It could be uncovered with some > stack digging, but I believe it would be easier just fix bcopy. > > Please apply this patch and reproduce the panic again, then the kgdb > backtrace should be more useful. OK, patch is applied. I will reboot the machine later and see what happens tomorrow in the morning. However, it might take a few days since the last 2 weeks all was fine. BTW, should this patch be used in general or is it just for debugging? My understanding is that it is something which could stay in the code... Thanks, -Andre > > diff --git a/sys/i386/i386/support.s b/sys/i386/i386/support.s > index 967a09e..779fa38 100644 > --- a/sys/i386/i386/support.s > +++ b/sys/i386/i386/support.s > @@ -181,11 +181,13 @@ END(bcopyb) > * ws@tools.de (Wolfgang Solfrank, TooLs GmbH) +49-228-985800 > */ > ENTRY(bcopy) > + pushl %ebp > + movl %esp,%ebp > pushl %esi > pushl %edi > - movl 12(%esp),%esi > - movl 16(%esp),%edi > - movl 20(%esp),%ecx > + movl 8(%ebp),%esi > + movl 12(%ebp),%edi > + movl 16(%ebp),%ecx > > movl %edi,%eax > subl %esi,%eax > @@ -196,12 +198,13 @@ ENTRY(bcopy) > cld /* nope, copy forwards */ > rep > movsl > - movl 20(%esp),%ecx > + movl 16(%ebp),%ecx > andl $3,%ecx /* any bytes left? */ > rep > movsb > popl %edi > popl %esi > + popl %ebp > ret > > ALIGN_TEXT > @@ -214,7 +217,7 @@ ENTRY(bcopy) > std > rep > movsb > - movl 20(%esp),%ecx /* copy remainder by 32-bit words */ > + movl 16(%ebp),%ecx /* copy remainder by 32-bit words */ > shrl $2,%ecx > subl $3,%esi > subl $3,%edi > @@ -223,6 +226,7 @@ ENTRY(bcopy) > popl %edi > popl %esi > cld > + popl %ebp > ret > END(bcopy) >