Date: Sun, 16 Oct 2005 17:01:28 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: Don Lewis <truckman@FreeBSD.org> Cc: freebsd-current@FreeBSD.org, obrien@FreeBSD.org Subject: Re: [PANIC] ufs_dirbad: bad dir Message-ID: <200510170001.j9H01S5h037788@apollo.backplane.com> References: <200510162257.j9GMvtIo060326@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
: :On 16 Oct, Matthew Dillon wrote: :> Ach. sigh. Another false alarm. Sorry. The code is fine. It's :> because the 'end' block is calculated inclusively, e.g. :> end_lbn = start_lbn + len - 1. I'm still investigating it. :> :> There is a bug if the range reallocblks is called with spans :> more then two blockmaps, but I don't think that case can occur in real :> life due to limitations in the range passed by the caller. Probably :> worth a KASSERT, though. : :Is there any correlation between this problem and the file system block :size? I've *never* encountered this problem, but I've only used block :sizes up to 16K, and mostly just 4K and 8K. I seem to have a dim memory :of a mention of problems of some sort with large block sizes. It's possible but unlikely. Ours tend to be 1K/8K or 2K/16K. The frag ratio is 1:8 in both cases so it doesn't hit the funny frag masking code in the fragment allocator. The error is not occuring in a fragment, either, it has so far only occured in blocks running through an indirect block. So far the two crash dumps I've looked at show corruption in the first or second block addressed by the first indirect block (lbn 12 and 13), which implies that an indirect block is getting trashed. But the indirect block itself looks ok. From the crash dumps I have, the indirect block (-12) was in the buffer cache and I was able to look at it. The contents of the block looked just fine: (kgdb) print $13.b_data $14 = 0xc1f00000 "ðAi" (kgdb) x/x $14 0xc1f00000: 0x006941f0 (kgdb) 0xc1f00004: 0x006944e0 <<<< this one (kgdb) 0xc1f00008: 0x00000000 (kgdb) 0xc1f0000c: 0x00000000 (kgdb) 0xc1f00010: 0x00000000 (kgdb) 0xc1f00014: 0x00000000 (kgdb) 0xc1f00018: 0x00000000 (kgdb) 0xc1f0001c: 0x00000000 ... This is consistent with the directory that the panic occured on. So the indirect block itself does not appear to be garbage. The DATA BLOCK, looks properly connected: (kgdb) print bp->b_lblkno $18 = 13 ^^^^^^ corresponds to the filesystem block 0x006944e0 above. (kgdb) printf "%08x\n", bp->b_bio.bio_blkno >> 1 006944e0 (bio_blkno is in device blocks, e.g. 512, so divided by 2 to get filesystem blocks). ^^^^^ Matches the data found in the indirect block (1K/8K) (kgdb) print $21->ufsmount_u.fs $25 = (struct fs *) 0xd1acb800 (kgdb) print *$25 $26 = { fs_firstfield = 0, fs_unused_1 = 0, fs_sblkno = 16, fs_cblkno = 24, fs_iblkno = 32, fs_dblkno = 1400, fs_cgoffset = 2048, fs_cgmask = -1, fs_time = 1129469246, fs_size = 37771928, fs_dsize = 36610722, fs_ncg = 839, fs_bsize = 8192, <<<<<< 1K/8K blocks fs_fsize = 1024, fs_frag = 8, fs_minfree = 8, fs_rotdelay = 0, fs_rps = 60, fs_bmask = -8192, fs_fmask = -1024, fs_bshift = 13, fs_fshift = 10, But the contents of the data block is not a directory. It looks like a piece of some other file: (kgdb) print bp $27 = (struct buf *) 0xc13ef8a0 (kgdb) print bp->b_data $28 = 0xc341c000 "1_CA_CRT, &output);\n\n if (output & GNUTLS_CERT_INVALID)\n {\n fprintf (stderr, \"Not trusted\");\n\n if (output & GNUTLS_CERT_SIGNER_NOT_CA)\n\tfprintf (stderr, \": Issuer is not a CA\\n\");\n "... (kgdb) One of my users is reporting that multiple fscks are required to clean up the filesystem after the dirbad panic. I haven't gotten the fsck output from him but my guess is that there are duplicate blocks. David O'Brien has indicated that the problem occurs with softupdates turned on or off, so it isn't softupdates specifically. So my guess is that there is something going on in UFS or the buffer cache. I have a ton of bitmap sanity checks in DragonFly and none of them are being hit. I have background bitmap writes turned off in DragonFly, so it has nothing to do with them. I am investigating a number of things but at the moment I am at a loss as to the cause. -Matt Matthew Dillon <dillon@backplane.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200510170001.j9H01S5h037788>