Date: Thu, 4 Jul 2013 09:15:50 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Andre Albsmeier <Andre.Albsmeier@siemens.com> Cc: "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org>, John Baldwin <jhb@freebsd.org> Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Message-ID: <20130704061550.GI91021@kib.kiev.ua> In-Reply-To: <20130704052659.GA23398@bali> References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130616063942.GA72803@bali> <201306171530.31208.jhb@freebsd.org> <20130704051409.GA22021@bali> <20130704052440.GG91021@kib.kiev.ua> <20130704052659.GA23398@bali>
next in thread | previous in thread | raw e-mail | index | archive | help
--Q1ykKx60OEcPG/sP Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jul 04, 2013 at 07:27:00AM +0200, Andre Albsmeier wrote: > On Thu, 04-Jul-2013 at 07:24:40 +0200, Konstantin Belousov wrote: > > On Thu, Jul 04, 2013 at 07:14:09AM +0200, Andre Albsmeier wrote: > > > On Mon, 17-Jun-2013 at 21:30:31 +0200, John Baldwin wrote: > > > > On Sunday, June 16, 2013 2:39:42 am Andre Albsmeier wrote: > > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote: > > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote: > > > > > > > Each day at 5:15 we are generating snapshots on various machi= nes. > > > > > > > This used to work perfectly under 7-STABLE for years but since > > > > > > > we started to use 9.1-STABLE the machine reboots in about 10% > > > > > > > of all cases. > > > > > > >=20 > > > > > > > After rebooting we find a new snapshot file which is a bit > > > > > > > smaller than the good ones and with different permissions > > > > > > > It does not succeed a fsck. In this example it is the one > > > > > > > whose name is beginning with s3: > > > > > > >=20 > > > > > > > -r--r----- 1 root operator snapshot 72802894528 29 May 05= :15 s2-2013.05.28-03.15.04 > > > > > > > -r-------- 1 root operator snapshot 72802893824 29 May 05= :15 s3-2013.05.29-03.15.03 > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14= :22 s4-2013.05.23-06.38.44 > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14= :22 s5-2013.05.24-03.15.03 > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14= :22 s6-2013.05.25-03.15.03 > > > > > > >=20 > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kern= el > > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15): > > > > > > >=20 > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order revers= al: > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 uf= s (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240 > > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 de= vfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414 > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order revers= al: > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c sn= aplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976 > > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 uf= s (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626 > > > > > > >=20 > > > > > > > Unfortunatley no corefiles are being generated ;-(. > > > > > > >=20 > > > > > > > I have checked and even rebuilt the (UFS1) fs in question > > > > > > > from scratch. I have also seen this happen on an UFS2 on > > > > > > > another machine and on a third one when running "dump -L" > > > > > > > on a root fs. > > > > > > >=20 > > > > > > > Any hints of how to proceed? > > > > > >=20 > > > > > > Would it be possible to setup a serial console that is logged o= n this machine > > > > > > to see if it is panic'ing but failing to write out a crashdump? > > > > >=20 > > > > > Couldn't attach the serial console yet ;-(. But I had people > > > > > attach a KVMoverIP switch and enabled the various KDB options > > > > > in the kernel. Now we can see a bit more (see below) -- no > > > > > crashdump is being generated though. > > > >=20 > > > > :( Unfortunately these LORs don't really help with discerning the = cause of > > > > the reboot. If you have remote power access (and still wanted to t= est this) > > > > one option would be to change KDB to drop into the debugger on a pa= nic. > > > > Then you could connect over the KVM and take images of the original= panic > > > > along with a stack trace. > > >=20 > > > After a few days of no problems, the box decided to crash > > > during mksnap_ffs today ;-(. But now I have a crashdump, > > > see below. Unfortunatley, I cannot upload the dump somewhere > > > but if you ask me check whatever things I'll be happy to help. > > >=20 > > > kgdb /usr/obj/src/src-9/sys/palveli/kernel.debug vmcore.4 > > > GNU gdb 6.1.1 [FreeBSD] > > > Copyright 2004 Free Software Foundation, Inc. > > > GDB is free software, covered by the GNU General Public License, and = you are > > > welcome to change it and/or distribute copies of it under certain con= ditions. > > > Type "show copying" to see the conditions. > > > There is absolutely no warranty for GDB. Type "show warranty" for de= tails. > > > This GDB was configured as "i386-marcel-freebsd"... > > >=20 > > > Unread portion of the kernel message buffer: > > >=20 > > >=20 > > > Fatal trap 12: page fault while in kernel mode > > > fault virtual address =3D 0xcfb5e000 > > > fault code =3D supervisor write, page not present > > > instruction pointer =3D 0x20:0xc07cb2fe > > > stack pointer =3D 0x28:0xd83545d0 > > > frame pointer =3D 0x28:0xd835490c > > > code segment =3D base 0x0, limit 0xfffff, type 0x1b > > > =3D DPL 0, pres 1, def32 1, gran 1 > > > processor eflags =3D interrupt enabled, resume, IOPL =3D 0 > > > current process =3D 12929 (mksnap_ffs) > > > trap number =3D 12 > > > panic: page fault > > > KDB: stack backtrace: > > > db_trace_self_wrapper(c08207eb,d835441c,c05fdfc9,c081df13,c08a82e0,..= =2E) at db_trace_self_wrapper+0x26/frame 0xd83543ec > > > kdb_backtrace(c081df13,c08a82e0,c0801bfa,d8354428,d8354428,...) at kd= b_backtrace+0x29/frame 0xd83543f8 > > > panic(c0801bfa,c0845a01,c2bafae4,1,1,...) at panic+0xc9/frame 0xd8354= 41c > > > trap_fatal(c0ff6000,cfb5e000,2,0,265abf,...) at trap_fatal+0x353/fram= e 0xd835445c > > > trap_pfault(140da,0,c2baf930,c08b6a40,c282145c,...) at trap_pfault+0x= 2d7/frame 0xd83544a4 > > > trap(d8354590) at trap+0x41a/frame 0xd8354584 > > > calltrap() at calltrap+0x6/frame 0xd8354584 > > > --- trap 0xc, eip =3D 0xc07cb2fe, esp =3D 0xd83545d0, ebp =3D 0xd8354= 90c --- > > > bcopy(c2b36548,c2f194e0,0,0,0,...) at bcopy+0x1a/frame 0xd835490c > > > ffs_mount(c2b36548,c2db9000,ff,d8354c08,c2b665e4,...) at ffs_mount+0x= 15ee/frame 0xd8354a3c > >=20 > > From the crash dump in kgdb, do > > list *ffs_mount+0x15ee >=20 > (kgdb) list *ffs_mount+0x15ee > 0xc0748e8e is in ffs_mount (/src/src-9/sys/ufs/ffs/ffs_vfsops.c:483). > 478 > 479 /* > 480 * If this is a snapshot request, take the snapsh= ot. > 481 */ > 482 if (mp->mnt_flag & MNT_SNAPSHOT) > 483 return (ffs_snapshot(mp, fspec)); > 484 } > 485 > 486 /* > 487 * Not an update, or updating the name: look up the name It is not useful, bcopy does not create a frame, so the real caller of the failing bcopy gets lost. It could be uncovered with some stack digging, but I believe it would be easier just fix bcopy. Please apply this patch and reproduce the panic again, then the kgdb backtrace should be more useful. diff --git a/sys/i386/i386/support.s b/sys/i386/i386/support.s index 967a09e..779fa38 100644 --- a/sys/i386/i386/support.s +++ b/sys/i386/i386/support.s @@ -181,11 +181,13 @@ END(bcopyb) * ws@tools.de (Wolfgang Solfrank, TooLs GmbH) +49-228-985800 */ ENTRY(bcopy) + pushl %ebp + movl %esp,%ebp pushl %esi pushl %edi - movl 12(%esp),%esi - movl 16(%esp),%edi - movl 20(%esp),%ecx + movl 8(%ebp),%esi + movl 12(%ebp),%edi + movl 16(%ebp),%ecx =20 movl %edi,%eax subl %esi,%eax @@ -196,12 +198,13 @@ ENTRY(bcopy) cld /* nope, copy forwards */ rep movsl - movl 20(%esp),%ecx + movl 16(%ebp),%ecx andl $3,%ecx /* any bytes left? */ rep movsb popl %edi popl %esi + popl %ebp ret =20 ALIGN_TEXT @@ -214,7 +217,7 @@ ENTRY(bcopy) std rep movsb - movl 20(%esp),%ecx /* copy remainder by 32-bit words */ + movl 16(%ebp),%ecx /* copy remainder by 32-bit words */ shrl $2,%ecx subl $3,%esi subl $3,%edi @@ -223,6 +226,7 @@ ENTRY(bcopy) popl %edi popl %esi cld + popl %ebp ret END(bcopy) =20 --Q1ykKx60OEcPG/sP Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (FreeBSD) iQIcBAEBAgAGBQJR1RMVAAoJEJDCuSvBvK1B8AoQAJFjo/vCKR561LcJcSqI9UFf nwsWeOPOa6uXNoTQrV4S+vC2PDiT4MRLbSOIrT+gEVpnSg8W62Z1lSQBb+/VsrSS OYAhbPjsrtmMSS1Lk2A/3Crq9E05Oiq9hl9EVgvtlMKZRabckHAc/3pKmrzsnxUg Ft56OCEQiK0pI1czbglb3PGMP0vrHXi1OimQ+P5LKZCnmykuHEdKw3vdfZ2Fz9BO Vwbs8P4CJRMV5OKu/hlA5teZ1JrI/7XihGSYxlQ6csoa8I5NtWHf38h8Di7FbtwI vh8NEVnHzNc6W+mBFJjMl8Wi0oFsH1iJQxVas0tHQuguIA8EKi/6CxAhViFL44Qe 04jNXZJDf5uJLeUmor2Vce7ytNp7uY2ZhtgEgCTHZeyuayJ09tzXIj+GYeRx1O9Z KcEfxhPDSQI3IFAU+Ep9ibqgBVjLBiLhjesPbrg58W2/OIakvoSJcAWNUwfYHh9u 3sdhSfnrjfGULXvKK9VuWQX1dqGyPf7VVGZ+4lBjBwWbp3ziGAHqbpmGzLqXU54/ 4pMHIT2vdK1/40pbqHW2S8k3lg8qmB4PnqGN5UtrMEwqu82hXflpEuVEbr2YWgXp H+SjJUXm6ZZkhz6dRcSXrsppeXpsnxzipyX5tOZk+EuacVPiQ2xqlGSD5wmXU6Jo lOUP7CCWcrjgWSnr4Pgp =O1v/ -----END PGP SIGNATURE----- --Q1ykKx60OEcPG/sP--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130704061550.GI91021>