Date: Fri, 24 Jun 2011 07:06:59 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Kostik Belousov <kostikbel@gmail.com> Cc: freebsd-fs@FreeBSD.org, Garance A Drosehn <gad@FreeBSD.org> Subject: Re: [rfc] 64-bit inode numbers Message-ID: <20110624054322.V1086@besplex.bde.org> In-Reply-To: <20110623081140.GQ48734@deviant.kiev.zoral.com.ua> References: <20101201091203.GA3933@tops> <20110104175558.GR3140@deviant.kiev.zoral.com.ua> <20110120124108.GA32866@tops.skynet.lt> <4E027897.8080700@FreeBSD.org> <20110623064333.GA2823@tops> <20110623081140.GQ48734@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 23 Jun 2011, Kostik Belousov wrote: > On Thu, Jun 23, 2011 at 09:43:33AM +0300, Gleb Kurtsou wrote: >> On (22/06/2011 19:19), Garance A Drosehn wrote: >>> On 1/20/11 7:41 AM, Gleb Kurtsou wrote: >>>> I've updated the patch. New version is available here: >>>> https://github.com/downloads/glk/freebsd-ino64/freebsd-ino64-patch-2011-01-20.tgz >>>> >>>> Changelog: >>>> * Add fts, ftw, nftw compat shims in libc >>>> * Place libc compat shims in separate files, don't hack original >>>> implementations. >>>> * Fix dump/restore >>>> * Use ino_t in UFS code (suggested by Kirk McKusick) Of course in must not use ino_t in the parts of ffs related to the on-disk inode. Your patch does this, but I wonder if converts from the disk inode to ino_t too early in some places. C's type system is too weak to find wrong conversions easily. On an old system, I once use funky types like double or a pointer for at least mode_t to find all the places that assumed mode_t to be an int. This helped find all the places that assumed it to be an int of a particular size. >>>> * Keep ufs_ino_t (32 bit) for boot2 not to increase size >>>> >>> Sorry for replying to an older message, but a reply made in a different >>> thread reminded me about this project... >>> >>> Also, I may have asked this before. In fact, I'm almost sure that I started >>> a reply to this back in Jan/Feb, but my email client claims I never replied >>> to this topic... >>> >>> Are you increasing only the size of ino_t, or could you also look at >>> increasing the size of dev_t? (just curious...) >> >> Sure. Incorporating as much of similar changes as possible is good. Increasing the size of dev_t would be negatively good. Even when the minor number was meaningful and was abused to encode device control sparsely, 4 billion devices is thousands of times as many as needed. Without the sparse mapping, it is millions as many as needed. Reducing it back to 16 bits like it was in FreeBSD-1 would be good, but would break portability. Finding all the places that assume that it is 32 bits and changing them to uint32_t would be good. ffs is already partly correct here (unlike for ino_t). Its di_rdev is di_db[0], and di_db is either ufs1_daddr_t (int32_t) or ufs2_daddr_t (int64_t). Thus the on-disk type is already independent of dev_t. But this is only the start of being correct. ffs does blind assignments to and from va_rdev to dev_t's, and suffer overflows if the types are different. I hope the new ino_t code doesn't do blind assignments. Since opening of device nodes on ffs file systems is no longer supported, the device numbers in di_rdev are only used for compatibility: - mknod() still works to create specified device numbers, provided they fit in a 32-bit dev_t (strictly, 32-bit ones don't fit since ufs1_daddr_t only has 31 value bits, but the overflow for blind assigment of the 32nd value bit is benign on all supported arches). So you can still back up your FreeBSD-4 /dev or maybe your Linux /dev on an ffs file system. - mknod() is still abused by badsect(8) to encode bad sector numbers in di_rdev. This may even still work for ffs1. It is broken for ffs2 by the type mismatch, and the blind assignments result in the error not being detected (ffs2 has 64-bit sector numbers, and its di_rdev can encode these, but mknod() can only pass 32-bit device numbers). FreeBSD-1 had the same problem with 16-bit device numbers not being able to encode 32-bit sector numbers. I hoped I fixed badsect(8) enough to detect all cases where the blind assignment will fail. >> I've added Kostik and Matthew to CC list, it's for them to decide. >> >> dev_t on other OSes: >> NetBSD - uint64_t >> DragonFly - uint32_t >> Darwin - __int32_t >> OpenSolaris - ulong_t >> Linux - __u32 >> >> Considering this I think 3rd party software is not ready for such >> change. Well, it should be ready, since the size depends on the O/S. Suppose a NetBSD system actually uses 64-bit device numbers. FreeBSD cannot support this now, so it should give an error for an attempt to back up a NetBSD /dev, but the blind assignments may break this. ulong_t on Solaris might give the same problem on 64-bit machines, but I guess ulong_t is actually an obfuscation of uint32_t. >> Major/minor mapping to dev_t will get more complicated. >> >> And the most important question: what would you want it for? As far as I > Indeed, this is the right question. > >> can see major/minor numbers are ignored nowadays, major is zero, minor >> increases independently of device type: > This is only because you have too little /dev nodes. How can he have >= 4G /dev nodes to test this? :-) Ah, I think I see: for devfs, the major number is normally 0, and minor numbers don't encode anything and are allocated sequentially and may differ across boots. But there are only 24 minor number bits according to major/minor, so the major must change from 0 to 1 on the 2**24 ~= 16 millionth device or earlier (I think actually on the 2**8th = 256th device, due to the encoding of major/minor being for compatibility with 16-bit dev_t). > Look at the definitions of the major/minor in sys/types.h. These are only for compatibility. Even expanding dev_t would break this compatibility. The types of breakage are easier to see for reducing dev_t back to 16 bits. Then for devfs, the major number should change from 0 to 1 on the 256'th device, but nothing should break until the 65536th device; the major/minor split that is still displayed by ls(1) is meaningless. For non-devfs, things like backing up OtherOS's /dev or even your own /dev to an ffs file system will break on the 65536th device; anything depending on the encoding of minor numbers or the major/minor split will break on the 256th minor, but I can't see how anything in FreeBSD can reasonably depend (dynamically) on this encoding or split -- the device number is just an index for an actual device, and you can't do anything with it in a device node except copy the node. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110624054322.V1086>