From owner-freebsd-hackers  Mon Aug  5 19:34:33 2002
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 1AD1337B400; Mon,  5 Aug 2002 19:34:24 -0700 (PDT)
Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id A1B3E43E72; Mon,  5 Aug 2002 19:34:23 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0488.cvx40-bradley.dialup.earthlink.net ([216.244.43.233] helo=mindspring.com)
	by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 17buAm-0007Ta-00; Mon, 05 Aug 2002 19:34:13 -0700
Message-ID: <3D4F34FF.21862712@mindspring.com>
Date: Mon, 05 Aug 2002 19:31:27 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Lamont Granquist <lamont@scriptkiddie.org>
Cc: "Justin T. Gibbs" <gibbs@scsiguy.com>,
	Zhihui Zhang <zzhang@cs.binghamton.edu>, freebsd-hackers@FreeBSD.ORG,
	freebsd-scsi@FreeBSD.ORG
Subject: Re: transaction ordering in SCSI subsystem
References: <20020805184018.F2654-100000@coredump.scriptkiddie.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Lamont Granquist wrote:
> So it sounds like CAM has two features which aid in preserving data
> integrity.  First it serializes operations with the same tag and
> second it implements stall barriers in those pipelines?

No.  It merely maintains order of requests made to it when it
makes requests to underlying layers.

Stalls are either the result of hardware limitations (on the
bottom end) or semantic guarantees by the VFS (on the top end).

It is up to the VFS above whether it wants to make additional
requests to CAM, once there are requests outstanding.  It is up
to the hardware below whether it will accept or defer additional
requests from CAM.  CAM only limits requests from above when it
is itself limited by the hardware below.

It's more correct to say that it *propagates* pipeline stalls.

The distinction is important here.  The main performance issue
that you face as a result of a stall barrier is propagation delay
down to, and back up from, the physical hardware.

Because CAM permits concurrent operations, up to the limits of
the hardware, as long as you don't over-drive the hardware, it
adds latency on individual transactions, but does not itself
limit throughput in the normal case: any bottleneck you have is
a result of the underlying hardware, or of the explicit stalls
caused by the code above CAM voluntarily avoiding queueing new
requests until a previous request has been successfully completed.

It's a queue pool retention problem, just like a sliding window
protocol, like TCP/IP: for N requests, I end up with a single
round trip latency due to the introduced propagation delay.  N
can be arbitrarily large, up to the limits of the hardware.  If
you hit those limits, then you start eating a latency per request,
as requests become turnstiled by available tags for tagged commands,
etc..


> Is there a good SCSI reference out there for someone interested more in
> the features of CAM and the upper layers of the protocol?

The source code?  Justin?  I'm sure people would appreciate it
if someone would document this code.  Right now, most people
get to learn about it by writing a provider (like a SCSI driver)
or a consumer (kernel block I/O code) of CAM.  Trial by fire is
not the best way to learn things (IMO), since you only learn
about the main road and a few branches, rather than the shape and
size of the city.


> Also, at a higher level in the VFS layer, which operations are tagged the
> same?  Are operations on the same vnode tagged the same and then write
> barriers are introduced appropriately between data and metadata writes?
> Are operations on different vnodes always different tags?  And how is the
> consistancy of something like the block allocation table maintained?

Operations are not explicitly tagged on a per system call basis,
if that's what you are asking.  For your other questions, there
are really two different issues:

1)	POSIX semantics.  POSIX semantics dictate which operations
	must be committed to stable storage before other operations
	are permitted to proceed.  POSIX therefore implies some soft
	barrier points, and some hard barrier points (e.g. fsync(2)).
	These are normally handled by the FS, not by the system call
	layer (e.g. you might have NVRAM to which metdata changes
	are logged by the FS, so it can return immediately, so it's
	reall an FS thing).

2)	File System implementation details.  Each FS implementation
	has internal consistency guarantees that differ based on the
	FS implementation.  For example, if you constrast directory
	management in MSDOSFS and FFS, in FFS, a directory entry is
	just a *reference* to an inode (hence the ability to support
	hard links) and in MSDOSFS, the directory entry has both the
	name and the inode metadata, as one unit (a FAT entry -- hence
	the inability to support hard links).  What this means is that
	in FFS there are two operations that have an explicit ordering
	relationship, whereas in MSDOSFS, there is just one operation.
	This also explains why ordering can't be implemented at the
	upper level, in the system call layer.

The upshot of this is that consistency is maintained by semantics
specific to each FS implementation, which deal with that FS' decision
whether or not to implement a stall barrier for a particular operation
(e.g. "lock an inode; cause subsequent requests to sleep until the
current outstanding request is satisfied; unlock the inode" ... thus
causing operations to be ordered: inodes are per-FS objects, so any
stalling by waiting for operations to be acknowledged,k rather than
simply blindly queueing them has to be in FS specific code).

The reality is even more complicated; I've hidden a number of layers,
and I generalized quite a bit (I'm surprised no one has complained, yet).

For example, between the VFS and CAM is the VM, and a block I/O system
based on page-based I/O management, with specific exceptions allowed
for physical disk block access on a block-by-block basis for directory
entries.  So when you do a write of 35 bytes, you may end up having to
do a lot more than you think (e.g. page in the page containing the 35
bytes -- or two, if it spans a page boundary, modify the 35 bytes with
a copy from the user space buffer to the page hung off the vm_object_t
off the vnode to the file, pointed to by the per process open file table
pointed to by the proc structure, mark the page as dirty, sleep pending
it being written out if what we are dealing with is a metadata change,
etc.).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message