From owner-freebsd-current  Fri May 30 13:31:06 1997
Return-Path: <owner-current>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id NAA06432
          for current-outgoing; Fri, 30 May 1997 13:31:06 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id NAA06422
          for <current@freebsd.org>; Fri, 30 May 1997 13:31:03 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.8.5/8.6.9) id GAA19539; Sat, 31 May 1997 06:16:07 +1000
Date: Sat, 31 May 1997 06:16:07 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199705302016.GAA19539@godzilla.zeta.org.au>
To: bde@zeta.org.au, dfr@nlsystems.com
Subject: Re: disk cache challenged by small block sizes
Cc: current@freebsd.org
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>> It seems to fix all the speed problems (except ufs is still slower with
>> a larger fs blocksize) and the block leak in ex2tfs.
>
>If you roll vfs_bio.c back to rev 1.115, does it affect the speed of ufs
>with 8k blocksize?  I am not sure whether my changes to vfs_bio would
>affect that.

That wasn't it.  The slowdown was caused by ffs deciding to allocate all
the blocks starting with the first indirect block on a slower part of
the disk.  It attempts to pessimize all cases, but is confused by fuzzy
rounding :-).

Details:
1. The file system has size 96MB (exactly).
2. The defaults for a block size of 4K give 10 cylinder groups (cg's) with
   9 of size 10MB and one smaller one (slightly less than 6MB because of
   special blocks before the first cg).  The average size is about 9.6MB.
3. The defaults for a block size of 8K give 3 cg's with 2 of size 32MB
   and one slightly smaller one.  The average size is about 32MB.
4. I ran iozone on a new file system, so there was just one directory and
   one file.
5. The inode for the file was allocated in cg #0 in both cases.
6. The direct blocks were allocated in the same cg as the inode in both
   cases.
7. The first indirect block and subsequent data blocks are allocated on a
   cg with >= the average number of free blocks.  (The comments before
   ffs_blkpref() about using a rotor are wrong.  fs->fs_cgrotor is never
   used.)
8. In case (2), cg #0 is chosen because it has almost 10MB-metatada free
   and the average is about 9.6MB-metadata.
9. In case (3), cg #1 is chosen since it has significantly less than
   32MB-metadata free and the average is about 32MB-metadata.
10. In case (3), cg #1 starts a full 1/3 of the way towards the slowest
   parts of the disk and the speed is significantly slower there.

I think the combination of algorithms behind (6) and (7) is often wrong.
It's silly to put the direct blocks on a different cg than the indirect
blocks immediately following them.  The silliest case is for a new file
system with all cg's of the same size.  Then exact calculation of the 
average number of free blocks would result in the indirect blocks always
starting on cg #1 despite cg #0 being almost empty when the first indirect
block is allocated.

I added a bias towards using the same cg as the inode for the first indirect
block.  This is probably too strong.

Bruce

diff -c2 ffs_alloc.c~ ffs_alloc.c
*** ffs_alloc.c~	Mon Mar 24 14:21:27 1997
--- ffs_alloc.c	Sat May 31 03:08:56 1997
***************
*** 689,692 ****
--- 686,700 ----
  		startcg %= fs->fs_ncg;
  		avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
+ 		/*
+ 		 * Prefer the same cg as the inode if this allocation
+ 		 * is for the first block in an indirect block.
+ 		 */
+ 		if (lbn == NDADDR) {
+ 			cg = ino_to_cg(fs, ip->i_number);
+ 			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree / 2) {
+ 				fs->fs_cgrotor = cg;
+ 				return (fs->fs_fpg * cg + fs->fs_frag);
+ 			}
+ 		}
  		for (cg = startcg; cg < fs->fs_ncg; cg++)
  			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
***************
*** 694,698 ****
  				return (fs->fs_fpg * cg + fs->fs_frag);
  			}
! 		for (cg = 0; cg <= startcg; cg++)
  			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
  				fs->fs_cgrotor = cg;
--- 702,706 ----
  				return (fs->fs_fpg * cg + fs->fs_frag);
  			}
! 		for (cg = 0; cg < startcg; cg++)
  			if (fs->fs_cs(fs, cg).cs_nbfree >= avgbfree) {
  				fs->fs_cgrotor = cg;