Date: Tue, 12 Feb 2002 12:09:35 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Bruce Evans <bde@zeta.org.au> Cc: Matteo <drummino@yahoo.com>, <freebsd-stable@FreeBSD.ORG>, <freebsd-bugs@FreeBSD.ORG>, <aiuto@gufi.org> Subject: Re: Crash System with MSDOS file-system driver! Message-ID: <200202122009.g1CK9ZW64977@apollo.backplane.com> References: <20020212225510.F4096-100000@gamplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
: :On Mon, 11 Feb 2002, Matthew Dillon wrote: : :> Well, I think you did find a bug in VFS/BIO. Even UFS will unnecessarily :> break if a fully valid buffer is written and the write fails, due to :> b_resid being left non-zero (B_ERROR is cleared in that case and the :> write is re-queued). : :I think that failure is fairly safe. The read will just terminate early :for the bad block. Oh no, definitely not safe. It's similar to the problem we had when write-failures caused the dirty block to be removed from the buffer cache. You wind up with an inconsistent filesystem - for example, with inconsistent meta-data, as if the system just decided to randomly drop a dirty buffer. This can result in unrecoverable filesystem corruption. So instead of having a simple write error and a failure on that directory or file, one winds up with a write error plus additional unrelated filesystem corruption. Bitmap and inode blocks are especially vulnerable. :> I also don't think it's a good idea to just munge breadn() and leave :> bread() alone. : :bread() just calls breadn(). As a general case, the situation should apply to anyone locking a buffer, thus the placement in getblk(). If it weren't for NFS we could do it across the board. In anycase, once I get the rest of the MFCs done I'll do a quick workup and start testing a fix. :It is just a sanity check. Device drivers return a short read for EOF at :least. The EOF case shouldn't happen because filesystems are built of :whole blocks. Except I forgot the nfs case... I knew that short reads :were non-errors for block devices and thought that removing block devices :made them errors in all cases. Remember, the buffer cache is a logical block cache, not a physical block cache (though it does that too). Blocks are primarily relative to their logical files. NFS has no concept of physical backing store, and even UFS differentiates between a fragment-backed block and full-block-backed block. The UFS strategy code actually figures out the correct block size and so does not depend on b_resid to detect a short read. NFS does not have this luxury in the case of a directory or softlink but hacks around the directory problem by keeping track of the directory's ending offset. Physio depends on b_resid to detect an EOF condition. :> The API, in general, does not explicitly :> disallow EOF occuring in the first buffer so unless you believe an API :> change is in order and clearly document it in breadn()'s comments, you :> should not commit this patch. (For example, NFS doesn't use breadn() :> but if it did it would likely assume that EOF could occur anywhere, even :> in the first buffer). : :It's not a good API, so some changes to it wouldn't hurt. Currently, :callers of bread() are apparently required to detect short reads by :comparing bp->b_bcount - bp->b_resid with their requested size. Most :callers don't bother. physio() is the main exception. : :Bruce Short reads can only be implemented by the underlying physio. Filesystems such as UFS and MSDOSFS know how large the underlying physical device is (and for files know how large the underlying file is) and so simply ensure that they do not attempt to read past the underlying device's EOF or past the file EOF. This is why they can make the assumption that a short read will not occur. A read error will return a real B_ERROR. NFS depends on short reads for directory and symlink operations and also depends on short reads for regular file reads if the server chooses to optimize a block of zeros. Fortunately NFS only assumes persistent state for (non-VMIO) symlink reads. Physio depends on short reads for EOF detection but since buffers are not cached it is a transitory indicator (and, in anycase, also non-VMIO). The buffer cache API supports short reads. It is not a bug or a mistake. -Matt Matthew Dillon <dillon@backplane.com> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200202122009.g1CK9ZW64977>