From owner-freebsd-stable@FreeBSD.ORG Fri Jan 27 03:38:52 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 446D51065670 for ; Fri, 27 Jan 2012 03:38:52 +0000 (UTC) (envelope-from gpalmer@freebsd.org) Received: from noop.in-addr.com (mail.in-addr.com [IPv6:2001:470:8:162::1]) by mx1.freebsd.org (Postfix) with ESMTP id 03E0E8FC12 for ; Fri, 27 Jan 2012 03:38:52 +0000 (UTC) Received: from gjp by noop.in-addr.com with local (Exim 4.77 (FreeBSD)) (envelope-from ) id 1Rqce6-0007Q7-Jo; Thu, 26 Jan 2012 22:38:38 -0500 Date: Thu, 26 Jan 2012 22:38:38 -0500 From: Gary Palmer To: Jeremy Chadwick Message-ID: <20120127033838.GF17973@in-addr.com> References: <20120127024815.GD17973@in-addr.com> <20120127030906.GA67449@icarus.home.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120127030906.GA67449@icarus.home.lan> X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: gpalmer@freebsd.org X-SA-Exim-Scanned: No (on noop.in-addr.com); SAEximRunCond expanded to false Cc: freebsd-stable@freebsd.org Subject: Re: Panic on 7.4-RELEASE-p5 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jan 2012 03:38:52 -0000 On Thu, Jan 26, 2012 at 07:09:06PM -0800, Jeremy Chadwick wrote: > On Thu, Jan 26, 2012 at 09:48:15PM -0500, Gary Palmer wrote: > > Hi, > > > > My Soekris firewall just panic'd > > > > Fatal trap 12: page fault while in kernel mode > > fault virtual address = 0x14001d > > fault code = supervisor write, page not present > > instruction pointer = 0x20:0xc06fd2d3 > > stack pointer = 0x28:0xd63557bc > > frame pointer = 0x28:0xd63558a4 > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 0, pres 1, def32 1, gran 1 > > processor eflags = interrupt enabled, resume, IOPL = 0 > > current process = 1869 (tcpdump) > > trap number = 12 > > panic: page fault > > KDB: stack backtrace: > > db_trace_self_wrapper(c079bbfc,c07fb800,c078d680,d6355660,d6355660,...) at db_tr > > ace_self_wrapper+0x26 > > panic(c078d680,c07b829c,c36d6d24,1,1,...) at panic+0xed > > trap_fatal(c2c352d0,140000,2,8,d63556c8,...) at trap_fatal+0x234 > > trap_pfault(c2b13620,0,c368f460,4,c36d6b00,...) at trap_pfault+0x27a > > trap(d635577c) at trap+0x34e > > calltrap() at calltrap+0x6 > > --- trap 0xc, eip = 0xc06fd2d3, esp = 0xd63557bc, ebp = 0xd63558a4 --- > > softdep_disk_io_initiation(c2ade474,1000,c3003738,1,c2aab9b8,...) at softdep_dis > > k_io_initiation+0xb3 > > ffs_geom_strategy(c3003738,c2aab9b8,1000,edb000,0,...) at ffs_geom_strategy+0x10 > > c > > ufs_strategy(d6355910,d6355910,c07f3980,c378bbdc,c2aab9b8,...) at ufs_strategy+0 > > x64 > > bufstrategy(c378bc9c,c2aab9b8,c368f460,c2aab9b8,c2bd46b4,...) at bufstrategy+0x2 > > e > > bufwrite(c2aab9b8,c2aabb04,20,c368f460,0,...) at bufwrite+0xf4 > > cluster_wbuild(c378bbdc,4000,3b7,0,8,...) at cluster_wbuild+0x6c9 > > cluster_write(c378bbdc,c2bd46b4,edc000,0,7f,...) at cluster_write+0x715 > > ffs_write(d6355bc0,c07091e4,c378bc34,0,c378bc64,...) at ffs_write+0x837 > > VOP_WRITE_APV(c07e9500,d6355bc0,c368f460,c07a2aec,252,...) at VOP_WRITE_APV+0xa0 > > vn_write(c2e2adf4,d6355c54,c36d7500,0,c368f460,...) at vn_write+0x26f > > dofilewrite(d6355c54,ffffffff,ffffffff,0,c2e2adf4,...) at dofilewrite+0x84 > > kern_writev(c368f460,4,d6355c54,d6355c74,1,...) at kern_writev+0x58 > > write(c368f460,d6355cf8,c,c0798e1b,39bfbd8f,...) at write+0x50 > > syscall(d6355d38) at syscall+0x1b9 > > Xint0x80_syscall() at Xint0x80_syscall+0x20 > > --- syscall (4, FreeBSD ELF32, write), eip = 0x28354663, esp = 0xbfbfe85c, ebp = > > 0xbfbfe878 --- > > Uptime: 11d5h5m49s > > Physical memory: 503 MB > > Dumping 112 MB: 97 81 65 49 33 17 1 > > Dump complete > > Automatic reboot in 1 seconds - press a key on the console to abort > > > > fsck prevented the box from coming up automatically > > > > /dev/ufs/varlog: PARTIALLY TRUNCATED INODE I=1496123 > > /dev/ufs/varlog: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. > > THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY: > > ufs: /dev/ufs/varlog (/var/log) > > > > Anyone seen this before? Any clues? > > Please shed some light as to your filesystem and disk setup. > Partitioning details, any GEOM layers you use, you get the idea. > The more verbose the better. Please be sure to state which filesystems > use softupdates and which do not (this matters a lot in this situation). kryten# mount /dev/ad0s1a on / (ufs, local, read-only) devfs on /dev (devfs, local) /dev/md0 on /etc (ufs, local) /dev/md1 on /etc (ufs, local) /dev/ufs/home on /home (ufs, local, soft-updates) /dev/ufs/var on /var (ufs, local, soft-updates) /dev/ufs/varlog on /var/log (ufs, local, soft-updates) devfs on /var/db/dhcpd/dev (devfs, local) (note the second md on /etc is an anomaly - there is normally only one and the daily security report already flagged that up as being different. Haven't looked into where / why that happened yet. This is a nanobsd build) kryten# glabel list Geom name: ad0s2a Providers: 1. Name: ufsid/4ef571bfcc16bfa6 Mediasize: 1015636480 (969M) Sectorsize: 512 Mode: r0w0e0 secoffset: 0 offset: 0 seclength: 1983665 length: 1015636480 index: 0 Consumers: 1. Name: ad0s2a Mediasize: 1015636480 (969M) Sectorsize: 512 Mode: r0w0e0 Geom name: ad1s1d Providers: 1. Name: ufs/varlog Mediasize: 107374182912 (100G) Sectorsize: 512 Mode: r1w1e1 secoffset: 0 offset: 0 seclength: 209715201 length: 107374182912 index: 0 Consumers: 1. Name: ad1s1d Mediasize: 107374182912 (100G) Sectorsize: 512 Mode: r1w1e2 Geom name: ad1s1e Providers: 1. Name: ufs/home Mediasize: 8589934592 (8.0G) Sectorsize: 512 Mode: r1w1e1 secoffset: 0 offset: 0 seclength: 16777216 length: 8589934592 index: 0 Consumers: 1. Name: ad1s1e Mediasize: 8589934592 (8.0G) Sectorsize: 512 Mode: r1w1e2 Geom name: ad1s1f Providers: 1. Name: ufs/var Mediasize: 3438215680 (3.2G) Sectorsize: 512 Mode: r1w1e1 secoffset: 0 offset: 0 seclength: 6715265 length: 3438215680 index: 0 Consumers: 1. Name: ad1s1f Mediasize: 3438215680 (3.2G) Sectorsize: 512 Mode: r1w1e2 Geom name: ad0s1a Providers: 1. Name: ufsid/4f125433ead0d4df Mediasize: 1015636480 (969M) Sectorsize: 512 Mode: r0w0e0 secoffset: 0 offset: 0 seclength: 1983665 length: 1015636480 index: 0 Consumers: 1. Name: ad0s1a Mediasize: 1015636480 (969M) Sectorsize: 512 Mode: r0w0e0 Geom name: ad0s3 Providers: 1. Name: ufsid/47d97f881feaef9a Mediasize: 17031168 (16M) Sectorsize: 512 Mode: r0w0e0 secoffset: 0 offset: 0 seclength: 33264 length: 17031168 index: 0 Consumers: 1. Name: ad0s3 Mediasize: 17031168 (16M) Sectorsize: 512 Mode: r0w0e0 # grep 'ad[01]' /var/run/dmesg.boot ad0: 1953MB at ata0-master WDMA2 ad1: DMA limited to UDMA33, device found non-ATA66 cable ad1: 114473MB at ata0-slave UDMA33 Trying to mount root from ufs:/dev/ad0s1a > As for the actual crash -- I assume the filesystem in question which > induced this has softupdates. I have no idea from the backtrace which filesystem was being written to at the time. The filesystem that it complained was unable to be checked on boot does have softupdates on it (although thinking about it I probably don't actually need SU) > Have you booted this box into single-user and issued a manual fsck on > all the filesystems? If not, please do so ASAP and provide details of > the results. I don't use background fsck and when the box tried rebooting it couldn't handle an unexpected inconsistency in one filesystem and dropped to single user mode. I did 2 fsck passes from single user mode before exiting out of the shell and letting the box finish booting. ( from a login shell after the boot has finished ) # grep fsck /etc/rc.conf background_fsck="NO" ( from my console log ) Starting file system checks: /dev/ad0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s1a: clean, 985492 free (33988 frags, 118938 blocks, 1.7% fragmentation) /dev/ad0s3: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s3: clean, 30883 free (139 frags, 3843 blocks, 0.4% fragmentation) /dev/ufs/home: 142 files, 1974 used, 4158281 free (81 frags, 519775 blocks, 0.0% fragmentation) /dev/ufs/var: UNREF FILE I=75348 OWNER=someuser MODE=100600 /dev/ufs/var: SIZE=2048 MTIME=Jan 25 14:47 2012 (CLEARED) /dev/ufs/var: UNREF FILE I=75349 OWNER=someuser MODE=100644 /dev/ufs/var: SIZE=39402 MTIME=Jan 25 14:43 2012 (CLEARED) /dev/ufs/var: UNREF FILE I=75350 OWNER=someuser MODE=100600 /dev/ufs/var: SIZE=2048 MTIME=Jan 25 14:47 2012 (CLEARED) /dev/ufs/var: FREE BLK COUNT(S) WRONG IN SUPERBLK (SALVAGED) /dev/ufs/var: SUMMARY INFORMATION BAD (SALVAGED) /dev/ufs/var: BLK(S) MISSING IN BIT MAPS (SALVAGED) /dev/ufs/var: 499 files, 2861 used, 1661962 free (234 frags, 207716 blocks, 0.0% fragmentation) /dev/ufs/varlog: PARTIALLY TRUNCATED INODE I=1496123 /dev/ufs/varlog: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY: ufs: /dev/ufs/varlog (/var/log) Automatic file system check failed; help! ERROR: ABORTING BOOT (sending SIGTERM to parent)! # fsck -p /dev/ad0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s1a: clean, 985492 free (33988 frags, 118938 blocks, 1.7% fragmentation) /dev/ad0s3: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s3: clean, 30883 free (139 frags, 3843 blocks, 0.4% fragmentation) /dev/ufs/home: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ufs/home: clean, 4158281 free (81 frags, 519775 blocks, 0.0% fragmentation) /dev/ufs/var: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ufs/var: clean, 1661962 free (234 frags, 207716 blocks, 0.0% fragmentation) /dev/ufs/varlog: PARTIALLY TRUNCATED INODE I=1496123 /dev/ufs/varlog: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY. THE FOLLOWING FILE SYSTEM HAD AN UNEXPECTED INCONSISTENCY: ufs: /dev/ufs/varlog (/var/log) # fsck -y /dev/ufs/varlog ** /dev/ufs/varlog ** Last Mounted on /var/log ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=1496123 SALVAGE? yes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? yes SUMMARY INFORMATION BAD SALVAGE? yes BLK(S) MISSING IN BIT MAPS SALVAGE? yes 907 files, 15527664 used, 36687228 free (860 frags, 4585796 blocks, 0.0% fragmen tation) ***** FILE SYSTEM MARKED CLEAN ***** ***** FILE SYSTEM WAS MODIFIED ***** # fsck -y /dev/ufs/varlog ** /dev/ufs/varlog ** Last Mounted on /var/log ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 907 files, 15527664 used, 36687228 free (860 frags, 4585796 blocks, 0.0% fragmen tation) # fsck -p /dev/ad0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s1a: clean, 985492 free (33988 frags, 118938 blocks, 1.7% fragmentation) /dev/ad0s3: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ad0s3: clean, 30883 free (139 frags, 3843 blocks, 0.4% fragmentation) /dev/ufs/var: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ufs/home: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ufs/home: clean, 4158281 free (81 frags, 519775 blocks, 0.0% fragmentation) /dev/ufs/varlog: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/ufs/varlog: clean, 36687228 free (860 frags, 4585796 blocks, 0.0% fragmenta tion) /dev/ufs/var: clean, 1661962 free (234 frags, 207716 blocks, 0.0% fragmentation) > These are confirmed cases of background_fsck not properly > catching/dealing with all filesystem errors, thus letting some > transparently pass through. You are running 7.4-p2, which makes this > likelihood even more probable. This is why I advocate (and have for > years) background_fsck="no" in rc.conf. I can dig up the (very long and > semi-heated) thread discussing this if you want, but I probably won't > get to it until next week. I have too much going on with job-related > things outside of FreeBSD to spend a lot of time with it now. Way ahead of you Jeremy :) None of my boxes use background FSCK for that very reason. I prefer to wait a bit longer for the boot than suffer ... oddities. Thanks, Gary