From owner-freebsd-current Sat Sep 19 10:45:44 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id KAA25389 for freebsd-current-outgoing; Sat, 19 Sep 1998 10:45:44 -0700 (PDT) (envelope-from owner-freebsd-current@FreeBSD.ORG) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id KAA25384 for ; Sat, 19 Sep 1998 10:45:43 -0700 (PDT) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id KAA09878; Sat, 19 Sep 1998 10:45:17 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp02.primenet.com, id smtpd009830; Sat Sep 19 10:45:08 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id KAA10912; Sat, 19 Sep 1998 10:45:03 -0700 (MST) From: Terry Lambert Message-Id: <199809191745.KAA10912@usr09.primenet.com> Subject: Re: panic: newdirrem inum 48733 should be 48732 (SMP+SOFTUPDATES) To: Don.Lewis@tsc.tdk.com (Don Lewis) Date: Sat, 19 Sep 1998 17:45:03 +0000 (GMT) Cc: tlambert@primenet.com, ken@plutotech.com, andreas@klemm.gtn.com, current@FreeBSD.ORG In-Reply-To: <199809190930.CAA10502@salsa.gv.tsc.tdk.com> from "Don Lewis" at Sep 19, 98 02:30:57 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > I don't think this problem is CAM. Are you running SCSI, or are you running IDE? > In the case of the initiate_write_filepage > panic, the softupdates code is doing some sanity checking on a buf before > writing it to disk. In this case, some modifications have been made to > the contents of the buf, but they can't be committed to disk because of > other dependencies. For one reason or another the code decides it > really needs to write this buf to disk, so it has to undo the modifications > so that the disk contents will be in a consistent state. While the code is > going through the list of the modifications and undoing them (with the > buf locked), it notices that what is currently stored in the buf is not > what it thinks was last put there, so it calls panic(). If the undo > had succeeded, the buf would have been written to disk, then the changes > would have been redone and the buf unlocked. I believe the patch addresses this dependency miss. This is the way it's supposed to work, BTW. > > } It is *imperitive* to the soft updates technology that async writes > } occur in the order they are requested to occur; the main privision > } of soft updates is to not advance the soft clock "wheel" until the > } previous wheel slot writes have been committed, such that all > } physical writes occur in dependency order. > > This is true, but it only makes a difference if there is a crash or > the hardware gets suddenly turned off in the middle. If things are > stored to disk in the wrong order and you don't get everything stored > to disk, then the filesystem will be in an inconsistent state when the > system boots, and you may be better off using newfs than fsck. If you > could guarantee that the system would stay up until it went into a > quiet state where you could flush everything to disk, it wouldn't matter > what order things were written before that point. Right; but you can't guarantee that. As a result, the flush-to-disk has to be in dependency order. The entire point of backing out buffer modifications with the buffer locked prepatory to a write is to ensure that things occur *in order* for disk blocks that can contain multiple objects. This means for multiple directory entry changes in a directory block, or for multiple inode changes in a block containing (4) inodes, etc.. The intent of the sync clock is to get things in dependency order, in lockstep, to the disk. If this is done, then the image on the disk can *not* have any inconsistency *other* than an incorrect cylinder group bitmap. Ever. Period. End of sentence. This was tested in the following test jig: 1) A machine with a serial console for debugging. 2) Another machine with a serial connection to allow breaking into the debugger on the first. 3) The SCSI host adapter on a target other than the default. 4) The SCSI busses chaned together between the machines. 5) At random intervals, during heavy FS activity, the machine engaged in the activity is broken into the debugger, effectively halting it. 6) The second machine then fsck's the first machines disk out from under it, reporting anything other than cylinder group bitmap inconsistancies and the clean flag in the superblock as fatal errors. 7) If no fatal errors occur, the first machine is exited from the debugger (effectively causing it to pick up where it left off. 8) If wall time < 1 week, go to 5. Probably, Julian or Kirk need to test this vs. a CAM kernel to find out where CAM is screwing up the ordering. PS: Now you know why I believe the problem is in the CAM integration; the soft updates code is not your average "it worked on my box" type of FreeBSD test-three-times-then-commit. PPS: My personal suspiscion is that it's in the tagged command queue, wherein ordering is not enforced between commands before acknowledgement; hence my prior suggestion that a "drain tagged command queue" entrypoint called from the syncer might be the easiest way to fix the problem... Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message