Date: Tue, 1 Nov 2005 15:31:33 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Ivan Voras <ivoras@fer.hr> Cc: freebsd-fs@freebsd.org Subject: Re: ext2 large_file Message-ID: <20051101141726.W41623@delplex.bde.org> In-Reply-To: <20051031201719.S68800@geri.cc.fer.hr> References: <20051030183340.B19470@geri.cc.fer.hr> <46D894BD-16E0-4CBA-B40A-EEBAAC2547D2@classicalguitar.net> <20051031191139.J38757@delplex.bde.org> <20051031160354.G67271@geri.cc.fer.hr> <20051101042444.K40281@delplex.bde.org> <20051031201719.S68800@geri.cc.fer.hr>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 31 Oct 2005, Ivan Voras wrote: > On Tue, 1 Nov 2005, Bruce Evans wrote: > >> Unless the file system already has or had a large file. Possible >> workarounds: >> (1) Boot Linux and create a large file. Hopefully e2fsck only sets the >> flag so you only have to do this once. > > I did this but e2fsck doesn't set the flag. Fortunately, I found out that > e2fsprogs includes "debugfs" utility with which I manually set the flag. > > It works now! Does e2fsck report the problem? > ext2 filesystem access is still a bit slower than with WindowsXP with > ext2+ext3 IFS driver (~20.5MB/s vs ~25MB/s). The reason I brought up this > subject is that I'm experimenting with using ext2 instead of msdosfs for > exchanging data between the systems in dual-boot configuration. Because ext2 > large_file support works now, I think it's much more safer and even somewhat > faster (less fragmentation! FreeBSD's msdosfs looks like it's > pessimized for fragmentation!) to use instead. Strangely enough, I first got interested in ext2fs under FreeBSD because testing showed that it was faster than ffs in one configuration, and this turned out to be mostly because of fragmentation: - ext2fs under FreeBSD has a primitive block allocator that will give lots of fragmentation over the long term but is almost optimal in simple tests. It doesn't really understand cylinder groups and just allocates the next free block, so in simple tests that creates files in one process and never delete files, the layout is almost optimal. In particular, the layout is good after copying a large directory tree to a new file system. You can see evidence of this using dump2fs -- it shows the first few cylinder groups full and the rest unused, where Linux would use all the groups fairly evenly. - ffs at the time had a not very good block allocator that optimized for fragmentation of directories (optimized for this == pessimized for performance), so it gave very poor peformance for large directory trees with small files. My test was with the Linux src tree. The FreeBSD ports tree would be pessimized more. This has been fixed. Now the problems in ffs's block allocator are more local. - my test drive at the time (1997?) didn't have much caching, and this interacted badly with ffs's block allocator. Even for sequentially created files, ffs likes to seek backwards to fill in fragments with small files, and the drive's cache size of caching algorithm apparently didn't like these backwards seeks although they are small. ffs still does this, but drives' caches are now large enough for another physical access to usually not be needed to get back to the small files. ffs's other known remaining allocation problems involve not allocating indirect blocks sequentially; this problem, or something related, is especially large for soft updates -- soft updates takes advantage of its delayed block allocation to put indirect blocks further away. This used to cause a 10% performance penalty for a freshly lai out copy of /usr/src, but now with bigger drives and cache it is less noticeable. I use the following to break the optimization for fregmentation in msdosfs: % Index: msdosfs_fat.c % =================================================================== % RCS file: /home/ncvs/src/sys/fs/msdosfs/msdosfs_fat.c,v % retrieving revision 1.35 % diff -u -2 -r1.35 msdosfs_fat.c % --- msdosfs_fat.c 29 Dec 2003 11:59:05 -0000 1.35 % +++ msdosfs_fat.c 26 Apr 2004 05:03:55 -0000 % @@ -68,4 +68,6 @@ % #include <fs/msdosfs/fat.h> % % +static int fat_allocpolicy = 1; % + % /* % * Fat cache stats. % @@ -759,4 +761,14 @@ % if (got) % *got = count; % + % + /* % + * For internal use, cluster pmp->pm_nxtfree is not necessarily free % + * but is the next place to look for a free cluster. Perhaps this % + * is the correct thing to pass to the next mount too. % + */ % + pmp->pm_nxtfree = start + count; % + if (pmp->pm_nxtfree > pmp->pm_maxcluster) % + pmp->pm_nxtfree = CLUST_FIRST; % + % return (0); % } % @@ -796,9 +808,30 @@ % len = 0; % % - /* % - * Start at a (pseudo) random place to maximize cluster runs % - * under multiple writers. % - */ % - newst = random() % (pmp->pm_maxcluster + 1); % + switch (fat_allocpolicy) { % + case 0: % + newst = start; % + break; % + case 1: % + newst = pmp->pm_nxtfree; % + break; % + case 5: % + newst = (start == 0 ? pmp->pm_nxtfree : start); % + break; % + case 2: % + /* FALLTHROUGH */ % + case 3: % + if (start != 0) { % + newst = fat_allocpolicy == 2 ? start : pmp->pm_nxtfree; % + break; % + } % + /* FALLTHROUGH */ % + default: % + /* % + * Start at a (pseudo) random place to maximize cluster runs % + * under multiple writers. % + */ % + newst = random() % (pmp->pm_maxcluster + 1); % + } % + % foundl = 0; % Only fat_allocpolicy == 1 and its case in the switch statement are needed here. The other cases are for testing how bad alternative simple allocators are. Policy 1 gives the same primitive sequential as in Linux -- this works well for copying but not so well when there is lots of file system activity with multiple concurrent processes. According to postmark, it is still much better than random allocation with multiple processes (but more like 2 to 4 times than 10 times). The fix for advancing pm->pm_nxtfree might not be needed. IIRC, it is mostly part of a fix for passing pm_nxtfree across mounts. With these and some other optimization for msdosfs, and optimizations and pessimizations for ext2fs, I get access times for a fresh copy of 75% of /usr/src (all that will fit in VMIO on a system with 1GB -- source always fully cached): % bde-current writing to IBM IC35L060AVV207-0 h: 24483060 57512700 % tar = tar % srcs = "contrib crypto lib sys" in /i/src % --- % % ffs-16384-02048-1: % tarcp /f srcs: 50.93 real 0.22 user 6.68 sys % tar cf /dev/zero srcs: 13.63 real 0.17 user 2.35 sys % ffs-16384-02048-2: % tarcp /f srcs: 45.15 real 0.27 user 6.71 sys % tar cf /dev/zero srcs: 14.99 real 0.20 user 2.33 sys % ffs-16384-02048-as-1: % tarcp /f srcs: 21.91 real 0.38 user 4.54 sys % tar cf /dev/zero srcs: 13.82 real 0.21 user 2.30 sys % ffs-16384-02048-as-2: % tarcp /f srcs: 21.08 real 0.34 user 4.64 sys % tar cf /dev/zero srcs: 15.24 real 0.15 user 2.41 sys % ffs-16384-02048-su-1: % tarcp /f srcs: 42.25 real 0.37 user 4.87 sys % tar cf /dev/zero srcs: 14.13 real 0.15 user 2.37 sys % ffs-16384-02048-su-2: % tarcp /f srcs: 47.76 real 0.34 user 4.93 sys % tar cf /dev/zero srcs: 16.25 real 0.16 user 2.38 sys % % ext2fs-1024-1024: % tarcp /f srcs: 108.68 real 0.30 user 7.99 sys % tar cf /dev/zero srcs: 41.15 real 0.17 user 5.63 sys % ext2fs-1024-1024-as: % tarcp /f srcs: 81.10 real 0.29 user 7.03 sys % tar cf /dev/zero srcs: 41.57 real 0.19 user 5.62 sys % ext2fs-4096-4096: % tarcp /f srcs: 107.48 real 0.32 user 6.75 sys % tar cf /dev/zero srcs: 27.37 real 0.08 user 3.00 sys % ext2fs-4096-4096-as: % tarcp /f srcs: 61.87 real 0.34 user 5.72 sys % tar cf /dev/zero srcs: 27.33 real 0.16 user 2.93 sys % % msdosfs-4096-4096: % tarcp /f srcs: 41.53 real 0.48 user 8.69 sys % tar cf /dev/zero srcs: 16.94 real 0.18 user 4.40 sys Here the first 2 numbers attached to the fs name are the block and fragment size; "as" means async mount and "su" means soft updates; the final number for ffs is for ffs1 vs ffs2. This shows the following points: - soft updates (in this test) is now not much faster than ordinary (-nosync -noasync) mounts and is much slower than async mounts. It used to be only 1.5 times slower than async mounts. This test was run when bufdaemon was buggier than it is now and showed bufdaemon behaving badly under pressure, with only soft updates creating enough pressure to cause problems. - soft updates is still about 5% slower for readback. My kernel has changes to allocate indirect blocks sequentially, but only for ffs1 and I'm not sure if the fixes work for soft updates. - msdosfs is competitive wil non-async ffs provided it uses clustering and VMIO as in my version. However, it cheats to get this -- its most important metadata, namely its FAT, is updated using delayed writes unless you mount with -sync. -sync is thus needed to get near the same robustness as the default for ffs. - ext2fs is about twice as slow as the other 2 (worse for non-async writes). For async writes, this is partly because -async is not fully implemented. It is mostly because the block size is very small, and although this only necessarily costs extra CPU to do clustering, FreeBSD is optimized for ffs's default block size and does pessimal things with ext2fs's smaller sizes. - Both msdosfs and ffs are as efficiect as can be hoped for for read-back: 13 to 16 seconds for reading 340MB of small files is 20-25MB/sec. This is on a drive with a max transfer rate of 45 or 55 MB/sec and not very fast (normal ATA 7200 rpm) seeks. On active (fragmented) file systems you have to be lucky to get half of that. On my active /usr, reading the same files takes 49 seconds. This is on a drive with a max transfer rate of 36MB/sec. > I propose this patch to the mount_ext2fs manual page: Someone else will have to look after this. You might have to file a PR so that it doesn't get lost. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20051101141726.W41623>