From owner-freebsd-questions  Wed Mar 13 09:59:28 1996
Return-Path: owner-questions
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id JAA13170
          for questions-outgoing; Wed, 13 Mar 1996 09:59:28 -0800 (PST)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id JAA13132
          Wed, 13 Mar 1996 09:59:19 -0800 (PST)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA08607; Wed, 13 Mar 1996 10:52:40 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199603131752.KAA08607@phaeton.artisoft.com>
Subject: Re: non-blocking read ?
To: msmith@atrad.adelaide.edu.au (Michael Smith)
Date: Wed, 13 Mar 1996 10:52:39 -0700 (MST)
Cc: luigi@labinfo.iet.unipi.it, leisner@sdsp.mc.xerox.com,
        msmith@atrad.adelaide.edu.au, questions@freebsd.org,
        current@freebsd.org
In-Reply-To: <199603122357.KAA00112@genesis.atrad.adelaide.edu.au> from "Michael Smith" at Mar 13, 96 10:27:08 am
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-questions@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> > Someone suggested doing async IO and handling SIGIO (I suppose this
> > refers to doing
> 
> See my previous message regarding not being sure about this, and
> definitely check how harvest does it, I'm sure I was wrong earlier 8(
> 
> > But how much data will be available when I get the SIGIO (or select
> > will return that I/O is possible) ? The amount I requested (assuming it
> > is available), or the system idea of a block, or what ?
> 
> "some"; you call read and examine the return value to see how much you get.
> 
> Thinking about it further, I don't see how this would work for disk I/O;
> it's not until the read itself is issued that the disk request is 
> queued ...

You get the SIGIO when there is data pending, not when the read has
completed (man fcntl, look for O_ASYNC).  All it is is a "data pending"
notification -- it is a hack around non-I/O based messaging mechanisms
so that they may be used (and then you hit select on a SIGIO) to let
you multiplex, for instance, System V message queues.


Unfortunately, it's not always possible to distinguish signal events
from their causes: that is, if I have multiple completions in the
time it takes me to handle a signal, I'm screwed, even if SIGIO passed,
for instance, the address of the buffer that completed to the handler.

Signals are persistent conditions.  Signals are *not* events.  There
is no "wait" equivalent call family for SGIO, like there is for SIGCLD.


It would be best if you dealt with the kernel reentrancy issues and
implemented aioread/aiowrite/aiowait/aiocancel.  This was a lot easier
under 1.1.5.1 (where I fully implemented a SunOS LWP clone library
at one time) because the lock recursion on read/write reeentrancy wasn't
overly complex, like it is in -current, with the unified VM and the
vnode/inode dissociation code (which I still think is broken).


To handle this, you would need to:

1)	Move VM to using device/extent instead of vnode/extent for
	the buffer cache.  This would allow

	o	Reuse of a vnode without discarding the buffer cache
		entry or needing an ihash entry as a second chance
		cache.

	o	Allow you to get around the IN_RECURSE and lock
		complexity in the VOP_LOCK/VOP_UNLOCK code, which
		currently affects both the vnode and the underlying
		FS.

	This would limit device size from 8TB down to 1TB.  File size
	limitations would not change.

2)	With the VM change in place, you would need to change the
	VOP_LOCK code.  Specifically, you would need to define
	counting semaphores for routines, probably called vn_lock
	and vn_unlock.  These routines would acquire the vnode lock
	(allowing recursion for the same PID in all cases) and then
	call the underlying FS's VOP_LOCK/VOP_UNLOCK respectively.
	If the VOP_LOCK call failed, the vn_lock would be released
	and failed.  This allows for an FS specific "veto", and is
	there solely to support FS layering for union, translucent,
	loopback, and overlay FS's (an overlay FS would be, for
	example, a umsdos-on-dos, vfat-on-dos, or quota-on-any
	type FS).

	The lock code changes would allow a process to have multiple
	outstanding read/write requests, as long as global structure
	modifications were semaphored.  This is the first step in
	kernel reeentrancy, allowing kernel multithreading, or kernel
	preemption (necessary for POSIX RT processing); it is also
	the first step (with conversion of the semaphore to a mutex
	or, better, a hierarchical lock with a mutex "top") toward
	supporting multiple processor reentrancy for the VFS kernel
	subsystem.

3)	With the lock code changed, a single multiplexed system call
	should be designated as "aio".  Yes, it's possible to do 4
	call slots, but why?

	This system call would use stub wrappers to pass down alternate
	argument list selectors, and would provide an aioread/aiowrite/
	aiowait/aiocancel mechanism.

	An aioread or aiowrite need to be handled in the ordinary
	read or write path, and when a blocking operation is issued,
	need to pass a completion routine and an argument address.

	The completion routine is the same for all processes; the
	argument address is the address of a context structure, which
	points to the proc structure for the process that issues the
	request, as well as context information for the actual copy
	out (buffer length, buffer address for copyout, etc.).

	When an I/O completes, it needs to be unlinked from the "pending"
	list and linked to the "completed" list.  These are two new
	pointers hung off the proc structure.

	The aiowait/aiocancel operation operate on the context identifiers
	on the lists, with obvious results.

	A more generic mechanism would be to convert the aio multiplex
	call into a call gate instead.  This would allow the user to
	issue *any* system call as an async operation.

	A flag would be added to the sysent structure to flag calls that
	were allowed to return to user space pending completion; by
	default, the flag would be 0, but set to 1 for read and write
	for the first rev of the code.  Any operation that could require
	paging to be satisfied should, in fact, be capable of being issued
	asynchronously.

	A "middle ground" implementation would make the multiplex system
	call something like "aiosyscall" -- that is, the same as "syscall",
	the multiplex entry point for call by syscall.h manifest constant
	of all existing system calls.
	

This would not be a horrific amount of work, but it would require
dragging several of the kernel people into the process (well, not
really, but you'd need their approval to commit the changes).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.