From owner-freebsd-current@FreeBSD.ORG Sat Feb 18 07:49:10 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9428C16A420; Sat, 18 Feb 2006 07:49:10 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 39B5B43D45; Sat, 18 Feb 2006 07:49:04 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.4/8.13.4) with ESMTP id k1I7hXO3012257; Fri, 17 Feb 2006 23:45:03 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.4/8.13.4/Submit) id k1I7g3XA012241; Fri, 17 Feb 2006 23:42:03 -0800 (PST) Date: Fri, 17 Feb 2006 23:42:03 -0800 (PST) From: Matthew Dillon Message-Id: <200602180742.k1I7g3XA012241@apollo.backplane.com> To: Peter Jeremy References: <20060102222723.GA1754@dragon.NUXI.org> <200602180439.k1I4drNm010220@apollo.backplane.com> <20060218064523.GA684@turion.vk2pj.dyndns.org> Cc: David Rhodus , freebsd-current@freebsd.org Subject: Re: It still here... panic: ufs_dirbad: bad dir X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Feb 2006 07:49:10 -0000 : :On Fri, 2006-Feb-17 20:39:53 -0800, Matthew Dillon wrote: :> I'm running out of ideas. Right now my best idea is that there is :> something broken in the code that writes out the modified 'rewound' :> blocks. Perhaps an old version of a buffer, with old already-reused :> block pointers, is being written out and then something happens to :> prevent the latest version from being written out. I don't know, I'm :> grasping at straws here. If I could only reliably reproduce the bug :> I would write some code to record every I/O operation done on the :> raw device then track back to the write that created the corruption. : :Is it worth setting up a ring buffer that just stores the last few :thousand I/O requests and waiting for someone to trip over the panic? :This should work if the corruption is close (in temporal terms) to :the panic. : :-- :Peter Jeremy Only if the problem can be reproduced reliably. The actual corruption is likely occuring on the order of tens of thousands or even millions of I/O's prior to the actual panic. Valid 'clean' cached data is also probably hiding the corrupted on-disk blocks for a long period of time. The corruption is either: (1) The contents of a directory block is corrupted. (2) An indirect block pointer in the inode is corrupted. or (3) The block pointers in the indirect block itself becomes corrupted. I have NEVER seen any corruption of a direct block for a directory. Not once. I also have never seen any bitmap corruption (and I added a lot of sanity checks in the bitmap code). The corruption I have seen tends to be data starting at the beginning of a block pointed to by an indirect block. This implies that either the indirect block as stored in the inode is wrong, that the block pointers in the indirect block are wrong, or that the data in those blocks is wrong. If the leaf data were corrupted I would have expected the leaf data in the inode's direct blocks to be corrupted too, but I've never seen that. The block arrays in the indirect blocks that I have examined have always looked like real block arrays. i.e. one or two entries and then all zero's (due to the limited size of the directory). Perhaps softupdates is replacing the block array in an indirect block with an 'old' version due to indirect block dependancies and then never restoring it properly, somehow losing track of the mess. I just don't know. The corrupted data in the directories always looked like complete junk... part of some other file and not directory entries at all. The only scenario that I can think of is that softupdates is storing 'old' block pointers in the indirect block then losing track of the dependanc(ies) and never writing out the latest 'correct' block pointers. The old block pointers get reallocted to another purpose, such as a file, the 'clean' buffer containing the correct block pointers gets thrown away at some later point in time, and then the next access of that directory goes to disk, retrieves the 'bad' indirect block and related data, and panics. -Matt Matthew Dillon