Date: Thu, 24 Aug 2006 00:05:01 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Tor Egge <Tor.Egge@cvsup.no.freebsd.org> Cc: freebsd-fs@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. Message-ID: <20060823203148.M62850@delplex.bde.org> In-Reply-To: <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org> References: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> <20060822175540.V58720@delplex.bde.org> <20060822130743.GL56637@deviant.kiev.zoral.com.ua> <20060822.214638.74697110.Tor.Egge@cvsup.no.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 22 Aug 2006, Tor Egge wrote: >> I have a proposal. Sorry, I don't like it. >> 1. Remove IN_ACCESS, IN_UPDATE, IN_CHANGE from i_flag. For each flag, >> introduce two new i_ fields, e.g., i_access of type timespec, and >> i_accessed of boolean type. > > On amd64, sizeof(struct timespec) is 16 bytes and sizeof(struct boolean_t) is 4 > bytes. 3 * (16 + 4) = 60 bytes extra per inode. With 100K inodes that becomes > 6 MB extra memory. > > I don't see why all these extra fields are needed. I think i_access is to avoid changing the dinode, but changes to the dinode are best avoided by just not changing it (as you give details on below). IN_ACCESS is not in the dinode so it shouldn't need a new field. >> 2. All places that currently set IN_ACCESS, instead would increment >> i_accessed using the atomic ops. ufs_itimes shall update i_access under some M >> mutex if i_accessed is greater than zero. > > Protecting the existing i_flag and the timestamps with the vnode interlock when > the current thread only has a shared vnode lock should be sufficient to protect > against the races, removing the need for #3, #4 and #4 below. I agree that this should be sufficient. Don't know if it is. Actually, I thought that the vnode lock was more exclusive. How can a shared lock work even for a reader if a writer is changing inode contents? > What's left is avoiding setting IN_MODIFIED when it's unsafe, to protect > against the deadlock. > >> 3. Check the i_access instead of the IN_ACCESS. > >> 4. ffs_update and ffs_syncvnode shall do the DIP_SET(i_atime) under the mutex >> from #2 before the main run and set IN_MODIFIED accordingly if i_accessed is >> not 0. >From a followup: % ffs_update shall be excluded, only ffs_syncvnode left in the list. % ffs_syncvnode is enclosed in the vn_start_write braces. >> 4. ufs_getattr shall retrieve the *time from new i_ fields under the mutex >> from #2 if corresponding i_ flag is set. > >> Basically, I want to set IN_MODIFIED i_flag (induced by IN_ACCESS and others) >> only under exclusive vnode lock. Moreover, i_accessed can be zeroed only >> under exclusive lock. This way, even shared lock on the vnode shall be enough >> to safely update modification times, and the times are moved to the disk >> often enough (at least, at the sync of the syncer vnodes). > > An exclusive vnode lock isn't needed, see above. Holding an exclusive vnode > lock does not make it safe to set IN_MODIFIED. Locking is complicated enough even if you can actually lock things :-(. > There are some constraints with regards to setting IN_MODIFIED on an inode. > > If neither IN_CHANGE nor IN_UPDATE is set then it might be unsafe to set > IN_MODIFIED since the file system might be suspended or in the process of being > suspended with the vnode sync loop in ffs_sync() having iterated past the > vnode. The case can't happen. IN_CHANGE is always set if IN_MODIFIED will (or should be) set later. There are some buggy cases where the combined setting of { IN_CHANGE, IN_MODIFIED } is incorrect, but these don't cause any problems here. We just have to avoid setting (or having to set) IN_MODIFED when setting it is not safe. Hopefully this is only in suspend mode. > If the file system is suspended then IN_MODIFIED cannot be set. If IN_MODIFIED, > IN_CHANGE or IN_UPDATE is set and the file system is suspended then something > is wrong. I think ufs_itimes() needs to use a non-blocking vn_start_write() and do nothing (except perhaps assert that the above harmful IN_* flags are not set) if it (ufs_itimes()) would set IN_MODIFIED. > If the file system is in the process of being suspended then IN_MODIFIED can be > set at the cost of triggering a restart of the vnode sync loop in ffs_sync(). Yes, once IN_MODIFIED is set, it is up to the sync loop or somewhere near it to ensure that suspend mode is not entered while an IN_MODIFIED flag is set for _any_ inode, irrespective of why or how IN_MODIFIED was set (whatever set it should be free to call vn_finish_write()). > If either IN_MODIFIED, IN_CHANGE or IN_UPDATE is already set then the vnode > sync loop has not reached the vnode, and a restart isn't needed. > > When ufs_itimes() cannot set IN_MODIFIED then it has to either risk losing the > access time update or use some mechanism to defer it (e.g. set IN_LAZYMOD or > a new flag and let process_deferred_inactive() set IN_MODIFIED after the file > system has been resumed). Yes, IN_LAZYMOD can be used to reduce the problem, and not much would be lost by throwing away atime changes that occur during a suspend if there are too many to convert to IN_LAZYMOD. Note that IN_LAZYMOD is completely (?) unused in -current: - it us not used for soft updates for historical reasons (initially mostly FUD) - it was only used for special files, but now devfs is used for special files. I have used IN_LAZYMOD for atime-only updates of all file types in ufs_itimes() for several years, but haven't tested it much since I mount almost all file systems with -noatime. The patch is simple. IN_LAZYMOD still gets turned into IN_MODIFIED in ufs_reclaim() so using it only for atime-only updates wouldn't completely fix the problem. It's hard for ufs_reclaim() to do anything except discard IN_LAZYMOD updates now. Summary of my understanding of this problem: - no problem with normal "writes" (at least modulo the sync loop checking the correct flags and/or bugs related to missing settings of IN_MODIFED). We wait until writes complete before entrering suspend mode, and don't allow writes in while in suspend mode. - problem for atime updates caused by reads. These become writes while in suspend mode. We want reads to work in suspend mode, so we cannout disallow reads, and we cannot disallow the implicit writes without breaking atime semantics. This problem can be partly avoided by ignoring IN_ACCESS in ufs_itimes() while in suspend mode, or by converting it to IN_LAZYMOD. If the inode gets reclaimed then we lose the atime update; otherwise the atime gets updated some time after we leave suspend mode, and either way the update doesn't go to snapshots, as is necessary for having a coherent snapshot. - problem for IN_LAZYMOD in ufs_reclaim(). Currently not reached. A quick fix would be to lose whatever updates are being done lazily. - problem syncing atimes while entering suspend mode. For writes, hopefully we get a consistent snapshot by disallowing new writes while entering sync mode (not just when it is entered). This doesn't work for reads. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060823203148.M62850>