From owner-freebsd-stable@FreeBSD.ORG Mon Jun 17 13:25:50 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 8BF4A2D9; Mon, 17 Jun 2013 13:25:50 +0000 (UTC) (envelope-from Andre.Albsmeier@siemens.com) Received: from david.siemens.de (david.siemens.de [192.35.17.14]) by mx1.freebsd.org (Postfix) with ESMTP id F0DDD1693; Mon, 17 Jun 2013 13:25:49 +0000 (UTC) Received: from mail3.siemens.de (localhost [127.0.0.1]) by david.siemens.de (8.13.6/8.13.6) with ESMTP id r5HD5YXb002947; Mon, 17 Jun 2013 15:05:34 +0200 Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130]) by mail3.siemens.de (8.13.6/8.13.6) with ESMTP id r5HD5YcQ015444; Mon, 17 Jun 2013 15:05:34 +0200 Received: (from localhost) by curry.mchp.siemens.de (8.14.7/8.14.7) id r5HD5YEX098414; Date: Mon, 17 Jun 2013 15:05:34 +0200 From: Andre Albsmeier To: Jeremy Chadwick Subject: Re: FreeBSD-9.1: machine reboots during snapshot creation, LORs found Message-ID: <20130617130534.GA88058@bali> References: <20130531122611.GA6607@bali> <201305311051.03157.jhb@freebsd.org> <20130531172523.GA9188@bali> <20130616065441.GA15175@icarus.home.lan> <20130616080239.GA73100@bali> <20130616084937.GA17277@icarus.home.lan> <20130616095538.GA73648@bali> <20130616103007.GA19957@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130616103007.GA19957@icarus.home.lan> X-Echelon: X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses! User-Agent: Mutt/1.5.21 (2010-09-15) Cc: "freebsd-stable@freebsd.org" , John Baldwin X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Jun 2013 13:25:50 -0000 On Sun, 16-Jun-2013 at 12:30:07 +0200, Jeremy Chadwick wrote: > On Sun, Jun 16, 2013 at 11:55:38AM +0200, Andre Albsmeier wrote: > > On Sun, 16-Jun-2013 at 10:49:37 +0200, Jeremy Chadwick wrote: > > > On Sun, Jun 16, 2013 at 10:02:39AM +0200, Andre Albsmeier wrote: > > > > On Sun, 16-Jun-2013 at 08:54:41 +0200, Jeremy Chadwick wrote: > > > > > On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote: > > > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote: > > > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote: > > > > > > > > Each day at 5:15 we are generating snapshots on various machines. > > > > > > > > This used to work perfectly under 7-STABLE for years but since > > > > > > > > we started to use 9.1-STABLE the machine reboots in about 10% > > > > > > > > of all cases. > > > > > > > > > > > > > > > > After rebooting we find a new snapshot file which is a bit > > > > > > > > smaller than the good ones and with different permissions > > > > > > > > It does not succeed a fsck. In this example it is the one > > > > > > > > whose name is beginning with s3: > > > > > > > > > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04 > > > > > > > > -r-------- 1 root operator snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03 > > > > > > > > -r--r----- 1 root operator snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03 > > > > > > > > > > > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel > > > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15): > > > > > > > > > > > > > > > > May 29 05:15:00 palveli kernel: lock order reversal: > > > > > > > > May 29 05:15:00 palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240 > > > > > > > > May 29 05:15:00 palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414 > > > > > > > > May 29 05:15:04 palveli kernel: lock order reversal: > > > > > > > > May 29 05:15:04 palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976 > > > > > > > > May 29 05:15:04 palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626 > > > > > > > > > > > > > > > > Unfortunatley no corefiles are being generated ;-(. > > > > > > > > > > > > > > > > I have checked and even rebuilt the (UFS1) fs in question > > > > > > > > from scratch. I have also seen this happen on an UFS2 on > > > > > > > > another machine and on a third one when running "dump -L" > > > > > > > > on a root fs. > > > > > > > > > > > > > > > > Any hints of how to proceed? > > > > > > > > > > > > > > Would it be possible to setup a serial console that is logged on this machine > > > > > > > to see if it is panic'ing but failing to write out a crashdump? > > > > > > > > > > > > I'll try to arrange that. It'll take a bit since this > > > > > > box is 200 km away... > > > > > > > > > > > > Maybe I'll find another one nearby to reproduce it... > > > > > > > > > > SPECIFICALLY regarding "lack of crash dumps": I need to see the > > > > > following: > > > > > > > > > > * cat /etc/rc.conf > > > > > * cat /etc/fstab > > > > > > > > > > I may need output from other commands, but shall deal with that when I > > > > > see output from the above. Thanks. > > > > > > > > No problem, see below... > > > > > > > > To make a long story short, the machine dumps core perfectly > > > > (tested that a while ago), but not when dealing with _this_ > > > > issue... > > > > > > > > I dump on da1s1b and savecore fetches it from there and puts > > > > it on /var (sitting on da0), that's faster. > > > > > > > > rc.conf (beware, rc.conf.local exists): > > > > --------------------------------------- > > > > rcshutdown_timeout=180 > > > > tmpmfs=YES > > > > tmpsize="$(( `/sbin/sysctl -n hw.usermem` / 3000000 ))m" > > > > tmpmfs_flags="$tmpmfs_flags -v 1 -n" > > > > > > > > background_fsck=NO > > > > > > > > nisdomainname=ofw.tld > > > > pflog_flags=-S > > > > > > > > syslogd_flags=-svv > > > > inetd_enable=YES > > > > inetd_flags=-l > > > > named_flags="-S 1000" > > > > named_chrootdir="" > > > > rwhod_enable=YES > > > > sshd_enable=YES > > > > amd_enable=YES > > > > amd_flags="-F /etc/amd.conf" > > > > nfs_client_enable=YES > > > > nfs_access_cache=2 > > > > mountd_flags=-n > > > > rpcbind_enable=YES > > > > > > > > ntpdate_enable=YES > > > > ntpdate_hosts=ntp > > > > ntpd_enable=YES > > > > ntpd_flags="-p /var/run/ntpd.pid" > > > > > > > > nis_client_enable=YES > > > > nis_client_flags="-s -S ofw.tld,nis-16-1,nis-16-2" > > > > nis_server_flags=-n > > > > nis_yppasswdd_flags="-t /var/yp/src/master.passwd -f -v" > > > > > > > > defaultrouter=192.168.16.2 > > > > > > > > keyrate=fast > > > > > > > > sendmail_flags="-bd -q5m" > > > > sendmail_submit_flags="$sendmail_flags -ODaemonPortOptions=Addr=localhost" > > > > sendmail_msp_queue_flags="-Ac -q30m" > > > > sendmail_rebuild_aliases=NO > > > > > > > > lpd_enable=YES > > > > lpd_flags=-s > > > > chkprintcap_enable=YES > > > > dumpdev=AUTO > > > > clear_tmp_X=NO > > > > ldconfig_paths=/usr/local/lib > > > > ldconfig_paths_aout="" > > > > entropy_file=/boot/entropy-file > > > > > > > > > > > > rc.conf.local: > > > > -------------- > > > > hostname=typhon.ofw.tld > > > > ifconfig_msk0="inet 192.168.24.1/21" > > > > ifconfig_msk0_alias0="inet 192.168.24.10/32" > > > > > > > > named_enable=YES > > > > nfs_server_enable=YES > > > > > > > > nis_client_flags="-s -S ofw.tld,nis-24-1,nis-24-2" > > > > nis_server_enable=YES > > > > > > > > defaultrouter=192.168.24.2 > > > > > > > > lpd_flags=-l > > > > dumpdev=/dev/da1s1b > > > > quota_enable=YES > > > > > > > > > > > > fstab: > > > > ------ > > > > /dev/da0s1a / ufs noatime,rw 0 1 > > > > /dev/da0s1b none swap sw 0 0 > > > > proc /proc procfs rw 0 0 > > > > /dev/da0s1d /usr ufs noatime,rw 0 2 > > > > /dev/da0s1e /var ufs noatime,nosuid,rw 0 2 > > > > > > > > /dev/da10p1 /share2 ufs suiddir,groupquota,noatime,nosuid,rw 0 2 > > > > /dev/da10p2 /raid2 ufs userquota,noatime,nosuid,rw 0 2 > > > > > > Thank you. Can you show me output from the following? > > > > Thanks to you for looking into this... > > > > > > > > * camcontrol devlist > > > > at scbus0 target 0 lun 0 (da0,pass0) > > at scbus0 target 1 lun 0 (da1,pass1) > > at scbus1 target 0 lun 0 (da10,pass2) > > > > > * gpart show -p da1 > > > > => 63 17849937 da1 MBR (8.5G) > > 63 17849937 da1s1 freebsd [active] (8.5G) > > > > And here is gpart show -p da1s1 > > > > => 0 17849937 da1s1 BSD (8.5G) > > 0 16 - free - (8.0k) > > 16 599984 da1s1a freebsd-ufs (293M) > > 600000 2000000 da1s1d freebsd-ufs (976M) > > 2600000 11000000 da1s1e freebsd-ufs (5.3G) > > 13600000 4249937 da1s1b freebsd-swap (2.0G) > > > > > > > > I'm pretty sure I see the problem, but I want to be extra sure. > > > > I am curious already! > > Okay, theory #1 shot down -- you have a valid da1s1b. I was curious > because rc.conf had dumpdev=AUTO, rc.conf.local had dumpdev=/dev/da1s1b, > and /etc/fstab made no mention of /dev/da1s1b (as swap). So I was > thinking "oh, maybe he meant /dev/da0s1b" -- hence my camcontrol + gpart > request. :-) > > I have 2 more possibilities in mind. Could I get... I know now why it didn't dump: I use two disks da0 and da1. da1 is a copy of da0 so people can simply unplug da0 in case of problems and work with da1 (which becomes da0 then). Since da1 is normally unused, I automatically spin it down after booting. For some reason, the drive does not start when FreeBSD-9 wants to dump on it. If I start it manually, dumping will work. Or if I use FreeBSD-7 it works as well. Something must have changed between 7 and 9 here... Anyway, I will configure my FreeBSD-9 machines to dump on da0 so maybe we'll get a crash dump finally... -Andre > > * Output from: sysctl -a hw | grep mem: > > * Output from: uname -a (you can hide the machine name if you want) > > * Output from: strings /boot/kernel/kernel | egrep ^option > > Thanks. > > -- > | Jeremy Chadwick jdc@koitsu.org | > | UNIX Systems Administrator http://jdc.koitsu.org/ | > | Making life hard for others since 1977. PGP 4BD6C0CB | > -- A fool with a tool is still a fool.