From owner-freebsd-fs@FreeBSD.ORG  Sat Mar 28 02:45:07 2015
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id AAAB0F83;
 Sat, 28 Mar 2015 02:45:07 +0000 (UTC)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 6CFB5D51;
 Sat, 28 Mar 2015 02:45:07 +0000 (UTC)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 8E0C5428F24;
 Sat, 28 Mar 2015 13:44:58 +1100 (AEDT)
Date: Sat, 28 Mar 2015 13:44:57 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Alexander Motin <mav@freebsd.org>
Subject: Re: MAXBSIZE increase
In-Reply-To: <5515C421.4040703@FreeBSD.org>
Message-ID: <20150328111733.L963@besplex.bde.org>
References: <5515C421.4040703@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=ZuzUdbLG c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10
 a=JzwRw_2MAAAA:8 a=K_dCTBWPffNRIZAgkwQA:9 a=v5_RuTisAHe7God6:21
 a=T73ABXJZ4VF9rtWC:21 a=CjuIK1q_8ugA:10
Cc: freebsd-fs@freebsd.org,
 "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Mar 2015 02:45:07 -0000

> Experimenting with NFS and ZFS I found an inter-operation issue: ZFS by
> default uses block of 128KB, while FreeBSD NFS (both client and server)
> is limited to 64KB requests by the value of MAXBSIZE. On file rewrite
> that limitation makes ZFS to do slow read-modify-write cycles for every
> write operation, instead of just writing the new data.  Trivial iozone
> test show major difference between initial write and rewrite speeds
> because of this issue.
>
> Looking through the sources I've found and in r280347 fixed number of
> improper MAXBSIZE use cases in device drivers. After that I see no any
> reason why MAXBSIZE can not be increased to at least 128KB to match ZFS
> default (ZFS now supports block up to 1MB, but that is not default and
> so far rare). I've made a test build and also successfully created UFS
> file system with 128KB block -- not sure it is needed, but seems it
> survives this change well too.
>
> Is there anything I am missing, or it is safe to rise this limit now?

I see the following minor problems:

- static and dynamic allocation of MAXBSIZE bytes would be more wasteful
   than before.

- boot blocks used to do static allocation of MAXBSIZE bytes.  Now they
   just do ffs's sanity check that the block size is less that that.
   A block size larger than this is not necessarily invalid, but just
   unsupported by the running instance of the buffer cache layer (so
   unsupported by the running instance of ffs too).  Another or the
   same OS may have a larger MAXBSIZE, and the user may have broken
   portability by actually using this to create a file system that
   cannot be read by OS's with the historical MAXBSIZE.  This check is
   bogus for boot blocks, since they don't use the buffer cache layer.

   ufsread.c uses a sort of anti-buffer-cache to avoid problems but
   give extreme slowness.  It uses a virtual block size of 4K and does
   i/o 4K at a time with no caching.  The buffer must not cross a 64K
   boundary on x86, and the MI code states this requirement for all
   arches.  In i386/boot2, dmadat is 64K-aligned so the virtual buffer
   size could be up to 64K, except dmadat is used for 3 other buffers
   and only 4K is used for the data buffer.  The data structure for this
   is:

X /* Buffers that must not span a 64k boundary. */
X struct dmadat {
X 	char blkbuf[VBLKSIZE];  /* filesystem blocks */
X 	char indbuf[VBLKSIZE];  /* indir blocks */
X 	char sbbuf[SBLOCKSIZE]; /* superblock */
X 	char secbuf[DEV_BSIZE]; /* for MBR/disklabel */
X };
X static struct dmadat *dmadat;

   I don't like the FreeBSD boot code, and use my version of biosboot
   if possible.  I expanded its buffers and improved its caching a year
   or 2 ago.  Old versions have 2 buffers of size MAXBSIZE in commented-
   out code since this doesn't work, especially when written in C.  The
   commented-out code also sets a size of 4K for one of these buffers.
   This last worked, for the default ffs block size only, in about 1990
   (this code is from Mach).  The old code actually uses 3 buffers of
   size 8K, corresponding to 3 of the 4 buffers in dmadat.  This broke
   about 15 years ago when the default and normal ffs block size was
   increased to 16K.  I fixed this by allocating all of the buffers in
   asm.  From start.S:

X ENTRY(disklabel)
X 	. = EXT(boot1) + 0x200 + 0*276 + 1*0x200
X 
X 	.globl	EXT(buf)
X 	.set	EXT(buf), EXT(boot1) + 0x20000
X 	.globl	EXT(iobuf)
X 	.set	EXT(iobuf), EXT(boot1) + 0x30000
X 	.globl	EXT(mapbuf)
X 	.set	EXT(mapbuf), EXT(boot1) + 0x40000

   boot1 is loaded at a 64K boundary and overlaid with boot2, the same
   as in the -current boot2.  The above bits off 64K pieces of the
   heap for all large data structures.  boot2 only (64K?) for dmadat
   instead, using hackish C code.

   Then I improved the caching.  biosboot was using my old code which
   did caching mainly for floppies, since old systems were too slow to
   keep up with reading even floppies 1 512-block at a time.  It used
   a read-ahead buffer of size 18*512 = 9K to optimize for floppies up
   to size 1440K.  This worked OK for old hard disks with the old default
   ffs block size of 8K too.  But it gives much the same anti-caching
   as -current's virtual 4K buffers when the ffs block size is 16K or
   larger.  I didn't expand the cache to a large one on the heap, but
   just changed it to 32*512 = 16K to work well with my default ffs
   block size of 16K (32K is pessimal for my old disk), and fixed some
   alignment problems (the old code attempts to align on track boundaries
   but tracks don't exist for modern hard disks, and the alignment needs
   to be to ffs block boundaries else 16K-blocks would be split every
   time in the 16K "cache".

   Summary for the boot blocks: they seem to be unaffected by increasing
   MAXBSIZE, but their anti-cache works even better for fragmenting larger
   blocks.

- the buffer cache is still optimized for i386 with BKVASIZE = 8K.  64-bit
   systems don't need the complications and pessimizations to fit in
   i386's limited kva, but have them anyway.  When the default ffs block
   size was doubled to 16K, BKVASIZE was doubled to match, but the tuning
   wasn't doubled to match.  This reduced the effective number of buffers
   by a factor of 2.  This pessimization was mostly unnoticable, since
   memory sizes grew by more than a factor of 2 and and nbuf grew by about
   a factor of 2.  But increasing (nbuf*BKVASIZE) much more isn't possible
   on i386 since it reaches a kva limit.  Then when ffs's default block
   size was doubled to 32K, BKVASIZE wasn't even doubled to match.  If
   anyone actually uses the doubled MAXBSIZE, then BKVASIZE will be mistuned
   by another factor of 2.  They probably shouldn't do that.  A block size
   of 64K already works poorly in ffs.  Relative to a block size of 32K,
   It mainly doubles the size for metadata i/o without making much
   difference for data i/o, since data i/o is clustered.  An fs block size
   equal to MAXPHYS also makes clustering useless, by limiting the maximum
   number of blocks per cluster to 1.  That is better than the ratio of
   4/32 in ufsread and 9/{8,16,32} in old biosboot, but still silly.  Large
   fs block sizes (where "large" is about 2K) are only good when clustering
   doesn't work and the disk doesn't like small blocks.  This may be the
   case for ffs on newer hard disks.  Metdata is not clustered for ffs.
   My old hard disks like any size larger than 16K, but my not so old hard
   disks prefer 32K or above.

   nfs for zfs will actually use the new MAXBSIZE.  I don't like it using
   a hard-coded size.  It gives buffer cache fragmentation.  The new
   MAXBSIZE will non-accidentally match the fs block size for zfs, but even
   the old MAXBSIZE doesn't match the usual fs block size for any file
   system.

- cd9660 mount uses MAXBSIZE for a sanity check.  It can only support
   block sizes up to that, but there must be an fs limit.  It should
   probably use min(CD9660_MAXBSIZE, MAXBSIZE).

- similarly in msdosfs, except I'm sure that there is an fs limit of
   64K.  Microsoft specifies that the limit is 32K, but 64K works in
   FreeBSD and perhaps even in Microsoft OS's.

- similarly in ffs, except the ffs limit is historically identical
   to MAXBSIZE.  I think it goes the other way -- MAXBSIZE = 64K is
   the historical ffs limit, and the buffer cache has to support that.
   Perhaps ffs should remain at its historical limit.  The lower
   limit is still local in ffs.  It is named MINBSIZE.  Its value
   is 4K in -current but 512 in my version.   ffs has no fundamental
   limit at either 4K or 64K, and can support any size supported by
   the hardware after fixing some bugs involving assumptions that the
   superblock fits in an ffs block.

- many file systems use MAXBSIZE to limit the read-ahead for cluster_read().
   This seems wrong.  cluster_read() has a natural limit of geom's virtual
   "best" i/o size (normally MAXPHYS).  The decision about the amount of
   read-ahead should be left to the clustering code if possible.  But it
   is unclear what this should be.  The clustering code gets this wrong
   anyway.  It has sysctls vfs.read_max (default 64) and vfs.write_behind
   (default 1).  The units for these are broken.  They are fs-blocks.
   A read-ahead of 64 fs-blocks of size 512 is too different from a
   read-ahead of 64 fs-blocks of size MAXBSIZE whatever the latter is.
   My version uses a read-ahead scaled in 512-blocks (default 256 blocks
   = MAXPHYS bytes).  The default read-ahead shouldn't vary much with
   either MAXPHYS, MAXBSIZE or the fs block size, but should vary with
   the device (don't read-ahead 64 large fs blocks on a floppy disk
   device, as asked for by -current's read_max ...).

- ffs utilities like fsck are broken by limiting themselves to the buffer
   cache limit of MAXBSIZE, like the boot blocks but with less reason since
   they don't have space constraints and not being limited by the current
   OS is more useful.  Unless MAXBSIZE = 64K is considered to be a private
   ffs that escaped.  Then the ffs code should spell it FFS_MAXBSIZE or 64K.

Bruce