From owner-freebsd-arch@FreeBSD.ORG Thu Sep 1 13:47:34 2005 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DC9FB16A41F for ; Thu, 1 Sep 2005 13:47:34 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4766D43D45 for ; Thu, 1 Sep 2005 13:47:34 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.0.86]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j81DlIVc021240; Thu, 1 Sep 2005 23:47:18 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j81DlFfX002872; Thu, 1 Sep 2005 23:47:17 +1000 Date: Thu, 1 Sep 2005 23:47:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Dmitry Pryanishnikov In-Reply-To: <20050901113819.F95708@atlantis.atlantis.dp.ua> Message-ID: <20050901201602.X99455@delplex.bde.org> References: <20050901113819.F95708@atlantis.atlantis.dp.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Sep 2005 13:47:35 -0000 On Thu, 1 Sep 2005, Dmitry Pryanishnikov wrote: > During the hunting the bug kern/85503 (panic: wrong dirclust in msdosfs) > I've tried to think about the solution, but it seems to be > architecture-related. The problem is: msdosfs uses pseudo-inodes (that is, > the offset from the start of the partition to the start of directory entry > in bytes) which must therefore have off_t bitness (at least 64 bits). I've > found the primary error (lack of casts leaded to 32-bit result), but then > we should transfer this 64-bit "inode" number to vfs_hash_get(). Oops, > it also limited to u_int (32 bits on i386). Finally, I see that the > primary shortcoming here: in sys/vnode.h we have > > /* > * vfs_hash: (mount + inode) -> vnode hash. > */ > LIST_ENTRY(vnode) v_hashlist; > u_int v_hash; > > I think it's feasible and useful to upgrade type of v_hash to at least off_t. > In our days of large media we will face filesystems with more than 4 billions > files (and thus at least 64-bit inode numbers) quite often. Am I right? This is not needed yet. Filesystems with more than 4G files are not supported yet, since ino_t is 32 bits and is used in critical APIs (struct stat...). Also, d_fileno in struct dirent is 32 bits but not identical to ino_t (POSIX with XSI extensions requires it to be identical). Even when ino_t is enlarged, backwards- and sideways-compat syscalls must do something with unrepresentable inode numbers. It's best for filesystems to map large inode numbers to small ones so that unrepresentable inodes are never seen by upper layers. So all current file systems need to generate unique 32-bit inode numbers. This may be difficult, but once it is done I think the inode number can be used as a key to pass to the hash functions. (The key is bogusly named "hash" in the hash function args and in v_hash above.) For msdosfs, the inode number is essentially the byte offset divided by the size of a directory entry. The size is 32, so this breaks at a byte offset of 128G instead of 4G. Details: - the inode number of the root dir is 1 - the inode number of a non-root dir is the (byte offset of the dir as a file) / 32 - the inode number of a non-dir is the (byte offset of the dir entry for the file) / 32. For stat() and possibly for readdir() and hashget(), I think it would work to use the cluster offset of the file for the fake inode number. The cluster size is always much larger than 32, so this would work for much larger file systems. cd9660 has a related bug suite. Its inode numbers are byte offsets of directory entries, so they break at 4G as in msdosfs. In addition, cd9660 needs to support hard links, but a 1-1 mapping from directory entries to hard links is perfectly wrong. This bug was worked around for stat() and readdir() in rev.1.77 of cd9660_vnops.c, by using the block offset of the file for the fake inode number. cd9660_ihashget() remained broken for directories above 4G and for hard links (having multiple active vnodes for the same file can't be good but might not be too bad for a read-only file system). However, rev.1.77 was temporarily backed out in rev.1.98 because it broke cd9660_fhtovp() and/or nfs readdirplus(). File handles in cd9660 contain an inode number and cd9660_fhtovp() depends on this number being compatible with the one set by readdir(). This part of the bug suite is smaller in msdosfs -- there file handles are directory (cluster, offset) pairs so enough info is available to derive whatever number vget() wants. The log message for rev.1.77 says that cd9660 did things better in FreeBSD-1 where it used directory (cluster, offset) pairs more. It seems to have never used these for file handles, so I now think that it needs to use them in some places where it now abuses inodes to look up directory entries, but that cd9660_fhtovp() is not one of these places (or hard to fix). cd9660_fhtovp() only needs to look up vnodes and a unique inode number is enough for that; the (cluster, offset) pair in msdosfs's file handles is just to match deget()'s interface. Try always using the cluster offset of the file for the fake inode number in msdosfs. cd9660 is harder to fix since it has hard links and symlinks. IIRC, its "inode" numbers are needed only to look up directory entries for symlinks. Using the directory entry for the inode in cd9660_ihashget() is bad for all types of links. Bruce