Date: Thu, 22 Feb 2018 15:18:25 +0100 From: "O. Hartmann" <ohartmann@walstatt.org> To: Gary Jennejohn <gljennjohn@gmail.com> Cc: "O. Hartmann" <ohartmann@walstatt.org>, "Chris H" <bsd-lists@BSDforge.com>, "FreeBSD Current" <freebsd-current@freebsd.org>, Warner Losh <imp@bsdimp.com>, Ed Maste <emaste@freebsd.org>, Michael Tuexen <tuexen@freebsd.org>, Mark Johnston <markj@freebsd.org> Subject: Re: kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d Message-ID: <20180222151825.2b193c4a@freyja.zeit4.iv.bundesimmobilien.de> In-Reply-To: <20180222092620.7c327329@ernst.home> References: <f7ffa21203887e43e2acd399cf93871d@udns.ultimatedns.net> <20180220123953.5e987691@ernst.home> <20180222083707.73ae3036@freyja.zeit4.iv.bundesimmobilien.de> <20180222092620.7c327329@ernst.home>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 22 Feb 2018 09:26:20 +0100 Gary Jennejohn <gljennjohn@gmail.com> wrote: > On Thu, 22 Feb 2018 08:37:07 +0100 > "O. Hartmann" <ohartmann@walstatt.org> wrote: > > > On Tue, 20 Feb 2018 12:39:53 +0100 > > Gary Jennejohn <gljennjohn@gmail.com> wrote: > > > > > On Mon, 19 Feb 2018 14:18:15 -0800 > > > "Chris H" <bsd-lists@BSDforge.com> wrote: > > > > > > > I'm seeing a number of messages like the following: > > > > kernel: failed: cg 5, cgp: 0xd11ecd0d != bp: 0x63d3ff1d > > > > > > > > and was wondering if it's anything to be concerned with, or whether > > > > fsck(8) is fixing them. > > > > This began to happen when the power went out on a new install: > > > > FreeBSD dns0 12.0-CURRENT FreeBSD 12.0-CURRENT #0: Wed Dec 13 06:07:59 > > > > PST 2017 root@dns0:/usr/obj/usr/src/amd64.amd64/sys/DNS0 amd64 > > > > which hadn't yet been hooked up to the UPS. > > > > I performed an fsck in single user mode upon power-up. Which ended with > > > > the mount points being masked CLEAN. I was asked if I wanted to use the > > > > JOURNAL. I answered Y. > > > > FWIW the systems are UFS2 (ffs) have gpart labels, and were newfs'd > > > > thusly: newfs -U -j > > > > > > > > Thank you for all your time, and consideration. > > > > > > > > > > fsck fixes these errors only when the user does NOT use the journal. > > > You should re-do the fsck. > > > > > > > When first these mysterious errors occured on several boxes running CURRENT, > > that was in December 2017 if I'm right, I also whitnessed mysterious and > > frequent crashes on several SSD driven machines, where this error described > > above occured. > > > > While the error vanished somehow in the meanwhile while CURRENT proceeds, > > the crashes continued - on two boxes, I dumped restore the OS on the > > system's SSD by reformatting the SSD from sratch (UFS2, soft update+ > > journaling). On those boxes the mysterious crashes vanished since then! > > > > On box left so far, my workstation. And this box continous to crash now and > > started crashing today again while compiling world/kernel. > > > > The fun-part is: even after a clean shutdown, where I can not detect any > > filesystem inconsistencies and rebooting and, again: no reported > > inconsistencies on the console/messages/logs, the box crashes spontanously. > > Now (today) I could trigger the reboot by starting "make -j4 buildworld > > buildkernel" and after showing the initial compiler statements/build > > framework statements, the box went to Nirwana. A well known phenomenon > > right now. > > > > I checked now the consistency of the filesystem, here is the result of > > the /usr/obj tree, which is a dedicated GPT partition > > (label: /dev/gpt/usr.obj): > > > > > > [...] > > root@box1:~ # fsck -fy /dev/gpt/usr.obj > > ** /dev/gpt/usr.obj > > ** Last Mounted on /usr/obj > > ** Phase 1 - Check Blocks and Sizes > > ** Phase 2 - Check Pathnames > > UNALLOCATED I=515 OWNER=root MODE=0 > > SIZE=0 MTIME=Feb 22 07:25 2018 > > NAME=/usr/src/amd64.amd64/sys/BOX1/config.c.new > > > > UNEXPECTED SOFT UPDATE INCONSISTENCY > > > > REMOVE? yes > > > > DIRECTORY CORRUPTED I=169691 OWNER=root MODE=40775 > > SIZE=1536 MTIME=Feb 22 05:16 2018 > > DIR=/usr/src/amd64.amd64/sys/BOX1/modules/usr/src/sys/modules/nfsd > > > > UNEXPECTED SOFT UPDATE INCONSISTENCY > > > > SALVAGE? yes > > > > ** Phase 3 - Check Connectivity > > ** Phase 4 - Check Reference Counts > > ** Phase 5 - Check Cyl groups > > FREE BLK COUNT(S) WRONG IN SUPERBLK > > SALVAGE? yes > > > > SUMMARY INFORMATION BAD > > SALVAGE? yes > > > > BLK(S) MISSING IN BIT MAPS > > SALVAGE? yes > > > > 126922 files, 848197 used, 1178482 free (89210 frags, 136159 blocks, 4.4% > > fragmentation) > > > > ***** FILE SYSTEM MARKED DIRTY ***** > > > > ***** FILE SYSTEM WAS MODIFIED ***** > > > > ***** PLEASE RERUN FSCK ***** > > > > [...] > > > > When doing a installworld, I pre-emptively perform in single user mode > > before mounting the partitions a "fsck -yf" two times. In most cases, the > > filesystem are reported clean, but sometimes especially those under high > > I/O (/usr/src and mostly /usr/obj on this build machine) there are reports > > of corruption. > > > > As I reported, the very same behaviour occured on three boxes simultanously > > and I got rid of it by completely reformatting the SSDs (never had issues > > so far with HDD based boxes!). > > > > I hope I can refurbish this weekend the remaining box and I could report, if > > desired, whether this box returns to a healthy state as the others or if my > > observation was a simple coincidence of issues ... > > > > Thanks for the patience, > > > > I also see such problems only with SSDs. Probably because the SSDs > are buffering writes internally which never make it into the flash > chips, although the SSDs report that the writes were completed. > > HDDs apparently don't do that, or have a smaller cache. > > I then also run fsck in single-user mode, but I explicitly do NOT > use the journal, i.e., I do NOT run fsck -y. But I guess that using > fsck -fy is equivalent to not using the journal. > > In my case the SSDs are error free after doing the fsck without > using the jounal until the next crash happens. My box with a > Ryzen 5 1600 tends to hang for no apparent reason, so I see these > errors fairly frequently because I have to reset the box :( > In my case here, I do not have to wait for a crash with an inconsistent filesystem to have some weird behaviour with the journaling. Somehow, in my naive terms, there is some strange problem hidden on partitions. Since December last year I had very weird and bad corruptions of the filesystem when performing "make installworld": boot process stopped at BTX or claimed having no loader, although the installation process made it up to installing everything in /boot/; but other folders like /sbin oder /libexec contained nullified files. These corruptions even happend then, when I "fsck'ed" the SSD prior to "make installworld" in single-user mode. Result of that was a installation from a USB flash and then again, rebuild world, kernel, and so on. Those horrible failures went away on all SSD based systems when reformatting /usr/src, /usr/obj and /tmp (all dedicated partitions in my case) where the inconsitencies occured most. Those systems, where I also reformatted /, all of these problems went away! The remaining box were I havn't so far reformatted / is the box in question here. Now, after /usr/obj and /usr/src newly formatted, the horror corruptions while performing installworld disapperead, but the crashes are going on. Especially after heavy I/O with lots of storage operations trigger spontanous crashes. For me, it looks like there is something really fishy with the UFS2. Since I perfomr on three boxes almost daily buildworlds with CURRENT, I guess something happened to the filesystem when CURRENT got hickups and the "inconsistency" moved on until a complete newfs of the whole SSD. I'm sorry not being able having more qualified data ... Regards, Oliver
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180222151825.2b193c4a>