From owner-freebsd-current Fri May 30 13:31:06 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id NAA06432 for current-outgoing; Fri, 30 May 1997 13:31:06 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id NAA06422 for ; Fri, 30 May 1997 13:31:03 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.5/8.6.9) id GAA19539; Sat, 31 May 1997 06:16:07 +1000 Date: Sat, 31 May 1997 06:16:07 +1000 From: Bruce Evans Message-Id: <199705302016.GAA19539@godzilla.zeta.org.au> To: bde@zeta.org.au, dfr@nlsystems.com Subject: Re: disk cache challenged by small block sizes Cc: current@freebsd.org Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >> It seems to fix all the speed problems (except ufs is still slower with >> a larger fs blocksize) and the block leak in ex2tfs. > >If you roll vfs_bio.c back to rev 1.115, does it affect the speed of ufs >with 8k blocksize? I am not sure whether my changes to vfs_bio would >affect that. That wasn't it. The slowdown was caused by ffs deciding to allocate all the blocks starting with the first indirect block on a slower part of the disk. It attempts to pessimize all cases, but is confused by fuzzy rounding :-). Details: 1. The file system has size 96MB (exactly). 2. The defaults for a block size of 4K give 10 cylinder groups (cg's) with 9 of size 10MB and one smaller one (slightly less than 6MB because of special blocks before the first cg). The average size is about 9.6MB. 3. The defaults for a block size of 8K give 3 cg's with 2 of size 32MB and one slightly smaller one. The average size is about 32MB. 4. I ran iozone on a new file system, so there was just one directory and one file. 5. The inode for the file was allocated in cg #0 in both cases. 6. The direct blocks were allocated in the same cg as the inode in both cases. 7. The first indirect block and subsequent data blocks are allocated on a cg with >= the average number of free blocks. (The comments before ffs_blkpref() about using a rotor are wrong. fs->fs_cgrotor is never used.) 8. In case (2), cg #0 is chosen because it has almost 10MB-metatada free and the average is about 9.6MB-metadata. 9. In case (3), cg #1 is chosen since it has significantly less than 32MB-metadata free and the average is about 32MB-metadata. 10. In case (3), cg #1 starts a full 1/3 of the way towards the slowest parts of the disk and the speed is significantly slower there. I think the combination of algorithms behind (6) and (7) is often wrong. It's silly to put the direct blocks on a different cg than the indirect blocks immediately following them. The silliest case is for a new file system with all cg's of the same size. Then exact calculation of the average number of free blocks would result in the indirect blocks always starting on cg #1 despite cg #0 being almost empty when the first indirect block is allocated. I added a bias towards using the same cg as the inode for the first indirect block. This is probably too strong. Bruce diff -c2 ffs_alloc.c~ ffs_alloc.c *** ffs_alloc.c~ Mon Mar 24 14:21:27 1997 --- ffs_alloc.c Sat May 31 03:08:56 1997 *************** *** 689,692 **** --- 686,700 ---- startcg %= fs->fs_ncg; avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg; + /* + * Prefer the same cg as the inode if this allocation + * is for the first block in an indirect block. + */ + if (lbn == NDADDR) { + cg = ino_to_cg(fs, ip->i_number); + if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree / 2) { + fs->fs_cgrotor = cg; + return (fs->fs_fpg * cg + fs->fs_frag); + } + } for (cg = startcg; cg < fs->fs_ncg; cg++) if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) { *************** *** 694,698 **** return (fs->fs_fpg * cg + fs->fs_frag); } ! for (cg = 0; cg <= startcg; cg++) if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) { fs->fs_cgrotor = cg; --- 702,706 ---- return (fs->fs_fpg * cg + fs->fs_frag); } ! for (cg = 0; cg < startcg; cg++) if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) { fs->fs_cgrotor = cg;