From owner-freebsd-arch@FreeBSD.ORG  Sun Jul  5 14:11:55 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 48E6F1065670;
	Sun,  5 Jul 2009 14:11:55 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au
	[211.29.132.189])
	by mx1.freebsd.org (Postfix) with ESMTP id D7B688FC16;
	Sun,  5 Jul 2009 14:11:54 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c122-106-161-96.carlnfd1.nsw.optusnet.com.au
	(c122-106-161-96.carlnfd1.nsw.optusnet.com.au [122.106.161.96])
	by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	n65EBp4B013112
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Mon, 6 Jul 2009 00:11:52 +1000
Date: Mon, 6 Jul 2009 00:11:51 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Alexander Motin <mav@FreeBSD.org>
In-Reply-To: <4A50667F.7080608@FreeBSD.org>
Message-ID: <20090705223126.I42918@delplex.bde.org>
References: <4A4FAA2D.3020409@FreeBSD.org>
	<20090705100044.4053e2f9@ernst.jennejohn.org>
	<4A50667F.7080608@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: gary.jennejohn@freenet.de, freebsd-arch@FreeBSD.org
Subject: Re: DFLTPHYS vs MAXPHYS
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Jul 2009 14:11:55 -0000

On Sun, 5 Jul 2009, Alexander Motin wrote:

> Gary Jennejohn wrote:
>> On Sat, 04 Jul 2009 22:14:53 +0300
>> Alexander Motin <mav@FreeBSD.org> wrote:
>> 
>>> Can somebody explain me a difference between DFLTPHYS and MAXPHYS 
>>> constants? As I understand, the last one is a maximal amount of memory, 
>>> that can be mapped to the kernel, or passed to the hardware drivers. But 
>>> why then DFLTPHYS is used in so many places and what does it mean?
>> 
>> There's a pretty good comment on these in /sys/conf/NOTES.
>
> But it does not explains why.

DFLTPHYS is the default -- the size to be used when the correct size is
not known.  However, this is mostly broken:

- the correct size should always be known at a low level.  You have to
   know the maximum size for a device to know that this size is larger
   than the default, else using the default size won't work.  Also, you
   have to know that the default size is a multiple of the minimum size.
   Both of these are usually true accidentally, so things sort of work.

- the default size is defaulted inconsistently.  Geom hides the device
   maximum i/o size (d_maxsize, which is normally either 64K or DFLTPHYS
   which happen to be the same) from the top level of devices (it reblocks
   if necessary so that sizes up to (s_iosize_max, which is always
   MAXPHYS) work, so it is difficult to see the the low-level size or to
   use an i/o size that is a multiple of the device maximum i/o size if
   the latter is not a divisor or MAXPHYS.  This means that hard-coding
   MAXPHYS would work best in most places above the driver level, but most
   places have a mess of buggy layering (mnt_iosize_max is supposed to
   default to DFLTPHYS and then be changed to si_iosize_max when the latter
   is known, but some file systems forget to do this).

>>> Isn't it a time to review their values for increasing? 64KB looks funny, 
>>> comparing to modern memory sizes and data rates. It just increases 
>>> interrupt rates, but I don't think it really need to be so small to 
>>> improve interactivity now.

64K is large enough to bust modern L1 caches and old L2 caches.  Make the
size bigger to bust modern L2 caches too.  Interrupt rates don't matter
when you are transfering 64K items per interrupt.

>> I wonder whether all drivers can correctly handle larger values for
>> DFLTPHYS.

Most can't, since their hardware can't.  They can fake it (ata used to)
but there is negative point in this for most drivers, since geom already
reblocks for disk devices and reblocking would be wrong for devices like
tapes.

> There are always will be drivers/devices with limitations. They should just 
> be able to report that limitations to system. This is possible with GEOM, but 
> it doesn't looks tuned well for all providers. There are many places, when 
> DFLTPHYS used just with hope that it will work. IMHO if driver unable to 
> adapt to any defined DFLTPHYS value, it should not use it, but instead should 
> announce some specific value that it really supports.

cam scsi devices seem to be the only important ones that still hard-code
d_maxsize to DFLTPHYS.  Strangely, pre-cam scsi had the beginnings (or
remnants) of more sophisticated i/o size limiting.  In FreeBSD-1, it
has an xxminphys() function for every scsi device.  I think it was supposed
to be possible to ask any device for any i/o size, and minphys was used
for reblocking at a low level.  minphys was only implemented for scsi
drivers and wasn't part of the physio() as in Net/2 (?). For the aha1542
driver, minphys was:

% void 
% ahaminphys(bp)
% 	struct buf *bp;
% {
% /*      aha seems to explode with 17 segs (64k may require 17 segs) */
% /*      on old boards so use a max of 16 segs if you have problems here */
% 	if (bp->b_bcount > ((AHA_NSEG - 1) * PAGESIZ)) {
% 		bp->b_bcount = ((AHA_NSEG - 1) * PAGESIZ);
% 	}
% }

FreeBSD-1 doesn't have DFLTPHYS, and barely uses MAXPHYS.  MAXPHYS was 64K.
I think MAXBSIZE = 64K limited most transfers.  However, physio() uses a
buffer of size 256K, larger than it does today!, so apparently, device
drivers were responsible for lots of reblocking.  In the wd driver, the
reblocking consisted of doing 1 512-block at a time (I think it didn't
even do multiple sectors per interrupt then).

Bruce