Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Feb 2006 23:42:03 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Peter Jeremy <peterjeremy@optushome.com.au>
Cc:        David Rhodus <drhodus@machdep.com>, freebsd-current@freebsd.org
Subject:   Re: It still here... panic: ufs_dirbad: bad dir
Message-ID:  <200602180742.k1I7g3XA012241@apollo.backplane.com>
References:  <20060102222723.GA1754@dragon.NUXI.org> <200602180439.k1I4drNm010220@apollo.backplane.com> <20060218064523.GA684@turion.vk2pj.dyndns.org>

next in thread | previous in thread | raw e-mail | index | archive | help

:
:On Fri, 2006-Feb-17 20:39:53 -0800, Matthew Dillon wrote:
:>    I'm running out of ideas.  Right now my best idea is that there is
:>    something broken in the code that writes out the modified 'rewound'
:>    blocks.  Perhaps an old version of a buffer, with old already-reused 
:>    block pointers, is being written out and then something happens to 
:>    prevent the latest version from being written out.  I don't know, I'm
:>    grasping at straws here.  If I could only reliably reproduce the bug
:>    I would write some code to record every I/O operation done on the
:>    raw device then track back to the write that created the corruption.
:
:Is it worth setting up a ring buffer that just stores the last few
:thousand I/O requests and waiting for someone to trip over the panic?
:This should work if the corruption is close (in temporal terms) to
:the panic.
:
:-- 
:Peter Jeremy
    
    Only if the problem can be reproduced reliably.  The actual corruption
    is likely occuring on the order of tens of thousands or even millions
    of I/O's prior to the actual panic.   Valid 'clean' cached data is also
    probably hiding the corrupted on-disk blocks for a long period of time.

    The corruption is either:

    (1) The contents of a directory block is corrupted.
    (2) An indirect block pointer in the inode is corrupted.
    or 
    (3) The block pointers in the indirect block itself becomes corrupted.

    I have NEVER seen any corruption of a direct block for a directory.
    Not once.  I also have never seen any bitmap corruption (and I added
    a lot of sanity checks in the bitmap code).  The corruption I have
    seen tends to be data starting at the beginning of a block pointed to
    by an indirect block.

    This implies that either the indirect block as stored in the inode
    is wrong, that the block pointers in the indirect block are wrong,
    or that the data in those blocks is wrong.  If the leaf data were
    corrupted I would have expected the leaf data in the inode's direct
    blocks to be corrupted too, but I've never seen that. 

    The block arrays in the indirect blocks that I have examined have always
    looked like real block arrays.  i.e. one or two entries and then all
    zero's (due to the limited size of the directory).

    Perhaps softupdates is replacing the block array in an indirect block
    with an 'old' version due to indirect block dependancies and then never
    restoring it properly, somehow losing track of the mess.  I just don't
    know.  The corrupted data in the directories always looked like complete
    junk... part of some other file and not directory entries at all.  

    The only scenario that I can think of is that softupdates is storing
    'old' block pointers in the indirect block then losing track of the
    dependanc(ies) and never writing out the latest 'correct' block pointers.
    The old block pointers get reallocted to another purpose, such as a
    file, the 'clean' buffer containing the correct block pointers gets
    thrown away at some later point in time, and then the next access
    of that directory goes to disk, retrieves the 'bad' indirect block and
    related data, and panics.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200602180742.k1I7g3XA012241>