Date: Fri, 2 Sep 2005 21:39:06 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Dmitry Pryanishnikov <dmitry@atlantis.dp.ua> Cc: freebsd-arch@FreeBSD.org Subject: Re: kern/85503: panic: wrong dirclust using msdosfs in RELENG_6 Message-ID: <20050902205456.S2885@delplex.bde.org> In-Reply-To: <20050901183311.D62325@atlantis.atlantis.dp.ua> References: <20050901183311.D62325@atlantis.atlantis.dp.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 1 Sep 2005, Dmitry Pryanishnikov wrote: >>> I think it's feasible and useful to upgrade type of v_hash to at least >>> off_t. >> >> This is not needed yet. >> >> Filesystems with more than 4G files are not supported yet, since ino_t >> is 32 bits and is used in critical APIs (struct stat...). Also, > > Sorry, I don't agree with you. The current situation is ugly: not only > it forces us to play dirty tricks within filesystems in order to generate > unique 32-bit inode numbers, but also it creates an artificial limit If you want to fix this, first work on the much larger problems of enlarging ino_t and changing the not-unused ffs file system to support more than 4G files. Note that this was considered too hard to do for ffs2. Tricks to map to the API's inode number space are unavoidable due to the existence of compatibility APIs and belong in individual file systems since they are too hard to do generally. General code could only hash from a larger v_hash type to a smaller compat_subsystem_ino_t type and then somehow make the hash unique. It is only necessary for the result to be unique for files actually returned the the smaller ino_t's since boot time (or since mount time for a poor implementation that doesn't work as well as possible for at least nfs servers), but even this seems to require storing up to SMALLER_INO_T_MAX*sizeof(smaller_ino_t) bytes of history of recycled vnodes. > on maximum number of files for 32-bit architectures. E.g., on FreeBSD/ia64 > u_int is 64 bits, and thus it would be no problem for it's API to create and > handle more than 4G files/fs. But such a file system will be incompatible Actually u_int is 32 bits for ia64, and the ino_t API/ABI is indenpendent of the size of u_int. ino_t is uint32_t. > with FreeBSD/i386! Isn't this ugly? u_int has nothing to do with storage > size, while off_t has. It is clear that no media with maximum size of Neither u_int nor off_t has anything to do with the correct storage size here. off_t is a signed integer type suitable for representing offsets within files. Sicne off_t is unsigned, it is unsuitable for representing offsets within file systems. It just happens to work because it is 64 bits and an offset of 2^63-1 bytes is enough for anyone ;-). (Actually it is not even enough for offsets within files since offsets in /dev/kmem are often > 2^63 on 64-bit systems.) ino_t is closer to being the correct type. The type of v_hash certainly needs to be larger than ino_t. My main point is that although it could be larger so that file systems can easily create a (unique) id from things like (dirclust, diroffset) pairs, it is not useful for it to be larger since file systems need to create an id for the inode number anyway. (Creation in some file system, e.g. ffs, is just copying the inode number from the inode.) > off_t will contain more than off_t files, while we can't guarantee this > for u_int, which is bounded to CPU abilities. I think UNIX is about > compatibility between different architectures, isn't it? Unix is mostly about source-level compatibility. >> So all current file systems need to generate unique 32-bit inode >> numbers. This may be difficult, but once it is done I think the inode > ^^^^^^^^^^^^^^^^ > > ...and may be close-to-impossible. What if e.g. Microsoft invites say > FAT-2005 with variable-length directory entries? I'm not sure that for > every third-party filesystem it would be possible to generate 32-bit > pseudoinode. And it's very bad that we can't handle >4Gfiles/fs at all. It already invented variable-length entries for long names in 1990-1995 :-). But the sizes of the entries are multiples of 32. This is required for compatibility and won't change. I think I said that the inode number in msdosfs should be the cluster number of the first cluster in the file. This would be broken by variable-sized clusters (unlikely, and even less useful) or new file types like symlinks (useful and not so unlikely -- FreeBSD could add them as an extension). >> For msdosfs, the inode number is essentially the byte offset divided by >> the size of a directory entry. The size is 32, so this breaks at a byte >> offset of 128G instead of 4G. Details: > > This is also imperfect: it creates a lot of pain and limitations for > > options MSDOSFS_LARGE So use the cluster number and only worry about the limit of 16TB for 4K-clusters, etc. > So, while I understand complexity of such a transitions, but it's clear > that for long-term solution ino_t should be upgraded to the size of off_t > everywhere. For short-term one... Well, msdosfs isn't the worst case. Indeed. The only important cases are ffs and some network file systems that already support >= 4G files. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050902205456.S2885>