Date: Sun, 25 Nov 2007 18:21:36 -0800 From: Jeremy Chadwick <koitsu@FreeBSD.org> To: freebsd-stable@freebsd.org Subject: Re: RELENG_6 kernel panic + savecore(8) problem Message-ID: <20071126022136.GA1564@eos.sc1.parodius.com> In-Reply-To: <20071107232328.GA1678@eos.sc1.parodius.com> References: <20071107191611.GA1400@eos.sc1.parodius.com> <20071107232328.GA1678@eos.sc1.parodius.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Nov 07, 2007 at 03:23:28PM -0800, Jeremy Chadwick wrote: > On Wed, Nov 07, 2007 at 11:16:11AM -0800, Jeremy Chadwick wrote: > > Tracing pid 3 tid 100001 td 0xc7c6ad80 > > kdb_enter(3228441820,3228796672,3228487817,3867634632,256,...) at kdb_enter+48 > > panic(3228487817,3426817152,256,3228643296,0,...) at panic+206 > > handle_written_inodeblock(3459887104,3688934424,3226775710,3228787204,3228175693,...) at handle_written_inodeblock+1503 > > softdep_disk_write_complete(3688934424,3227842097,3356275348,3867634836,3226342800,...) at softdep_disk_write_complete+241 > > bufdone(3688934424,0,3867634856,3226352850,3356275348,...) at bufdone+126 > > g_vfs_done(3356275348,0,0,3352445440,3355957180) at g_vfs_done+198 > > biodone(3356275348,3228786984,588,3228423470,100,...) at biodone+178 > > g_io_schedule_up(3351686528,76,3351679512,3226344072,3867634980,...) at g_io_schedule_up+137 > > g_up_procbody(0,3867635000,0,0,0,...) at g_up_procbody+122 > > fork_exit(3226344072,0,3867635000) at fork_exit+122 > > fork_trampoline() at fork_trampoline+8 > > A follow-up to this: > > It appears that somehow a few of the filesystems on the disk (it's a > single-disk system) were suffering from some bizarre form of soft update > corruption. And tonight we had the same problem happen again. The backtrace when the machine panic'd is identical: Tracing pid 3 tid 100001 td 0xc7c6ed80 kdb_enter(c06e475e,c073ade0,c06efb55,e6876bc8,100,...) at kdb_enter+0x30 panic(c06efb55,ce04b280,100,c07156c0,0,...) at panic+0xce handle_written_inodeblock(c858d200,dbda0a70,c07388e4,c06a3e4a,e6876c30,...) at handle_written_inodeblock+0x5df softdep_disk_write_complete(dbda0a70,c0652591,c80e65ac,e6876c94,c04e16c4,...) at softdep_disk_write_complete+0xf1 bufdone(dbda0a70,0,e6876ca8,c04e3e06,c80e65ac,...) at bufdone+0x7e g_vfs_done(c80e65ac,0,0,c7d28200,c80a418c) at g_vfs_done+0xc6 biodone(c80e65ac,c0738808,24c,c06dff1c,64,...) at biodone+0xb2 g_io_schedule_up(c7c6ed80,4c,c7c6d218,c04e1bbc,e6876d24,...) at g_io_schedule_up+0x89 g_up_procbody(0,e6876d38,0,0,0,...) at g_up_procbody+0x7a fork_exit(c04e1bbc,0,e6876d38) at fork_exit+0x7a fork_trampoline() at fork_trampoline+0x8 Background fsck reported similar weirdness until I brought down the system into single-user, fsck'd things, and brought the system back up again. Also like last time, savecore found no cores on /dev/ad0s1b, despite one being done after doing "panic" in DDB. I believe I may have figured out why that's occurring, though: Using `rcorder /etc/rc.d/*` I'm able to see that the following rc scripts are run in this respective order: /etc/rc.d/dumpon <-- Runs `dumpon -v /dev/ad0s1b` /etc/rc.d/swap1 <-- Runs `swapon -a` /etc/rc.d/addswap <-- Adds additional swap spaces /etc/rc.d/savecore <-- Runs `savecore /var/crash /dev/ad0s1b` I believe the problem is that /etc/rc.d/swap1 is being run before savecore. I'm guessing that swapon(8) actually destroys/clobbers the existing saved kernel panic/core data, thus one will never get a coredump in /var/crash. I believe some re-organisation of rcorder(8) arguments in the files is in order, but I don't know what should be changed to what. I'll submit a PR for the above, because IMHO that's a serious one. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071126022136.GA1564>