From owner-freebsd-hackers Mon Aug 5 19:34:33 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1AD1337B400; Mon, 5 Aug 2002 19:34:24 -0700 (PDT) Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by mx1.FreeBSD.org (Postfix) with ESMTP id A1B3E43E72; Mon, 5 Aug 2002 19:34:23 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0488.cvx40-bradley.dialup.earthlink.net ([216.244.43.233] helo=mindspring.com) by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17buAm-0007Ta-00; Mon, 05 Aug 2002 19:34:13 -0700 Message-ID: <3D4F34FF.21862712@mindspring.com> Date: Mon, 05 Aug 2002 19:31:27 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Lamont Granquist Cc: "Justin T. Gibbs" , Zhihui Zhang , freebsd-hackers@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG Subject: Re: transaction ordering in SCSI subsystem References: <20020805184018.F2654-100000@coredump.scriptkiddie.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Lamont Granquist wrote: > So it sounds like CAM has two features which aid in preserving data > integrity. First it serializes operations with the same tag and > second it implements stall barriers in those pipelines? No. It merely maintains order of requests made to it when it makes requests to underlying layers. Stalls are either the result of hardware limitations (on the bottom end) or semantic guarantees by the VFS (on the top end). It is up to the VFS above whether it wants to make additional requests to CAM, once there are requests outstanding. It is up to the hardware below whether it will accept or defer additional requests from CAM. CAM only limits requests from above when it is itself limited by the hardware below. It's more correct to say that it *propagates* pipeline stalls. The distinction is important here. The main performance issue that you face as a result of a stall barrier is propagation delay down to, and back up from, the physical hardware. Because CAM permits concurrent operations, up to the limits of the hardware, as long as you don't over-drive the hardware, it adds latency on individual transactions, but does not itself limit throughput in the normal case: any bottleneck you have is a result of the underlying hardware, or of the explicit stalls caused by the code above CAM voluntarily avoiding queueing new requests until a previous request has been successfully completed. It's a queue pool retention problem, just like a sliding window protocol, like TCP/IP: for N requests, I end up with a single round trip latency due to the introduced propagation delay. N can be arbitrarily large, up to the limits of the hardware. If you hit those limits, then you start eating a latency per request, as requests become turnstiled by available tags for tagged commands, etc.. > Is there a good SCSI reference out there for someone interested more in > the features of CAM and the upper layers of the protocol? The source code? Justin? I'm sure people would appreciate it if someone would document this code. Right now, most people get to learn about it by writing a provider (like a SCSI driver) or a consumer (kernel block I/O code) of CAM. Trial by fire is not the best way to learn things (IMO), since you only learn about the main road and a few branches, rather than the shape and size of the city. > Also, at a higher level in the VFS layer, which operations are tagged the > same? Are operations on the same vnode tagged the same and then write > barriers are introduced appropriately between data and metadata writes? > Are operations on different vnodes always different tags? And how is the > consistancy of something like the block allocation table maintained? Operations are not explicitly tagged on a per system call basis, if that's what you are asking. For your other questions, there are really two different issues: 1) POSIX semantics. POSIX semantics dictate which operations must be committed to stable storage before other operations are permitted to proceed. POSIX therefore implies some soft barrier points, and some hard barrier points (e.g. fsync(2)). These are normally handled by the FS, not by the system call layer (e.g. you might have NVRAM to which metdata changes are logged by the FS, so it can return immediately, so it's reall an FS thing). 2) File System implementation details. Each FS implementation has internal consistency guarantees that differ based on the FS implementation. For example, if you constrast directory management in MSDOSFS and FFS, in FFS, a directory entry is just a *reference* to an inode (hence the ability to support hard links) and in MSDOSFS, the directory entry has both the name and the inode metadata, as one unit (a FAT entry -- hence the inability to support hard links). What this means is that in FFS there are two operations that have an explicit ordering relationship, whereas in MSDOSFS, there is just one operation. This also explains why ordering can't be implemented at the upper level, in the system call layer. The upshot of this is that consistency is maintained by semantics specific to each FS implementation, which deal with that FS' decision whether or not to implement a stall barrier for a particular operation (e.g. "lock an inode; cause subsequent requests to sleep until the current outstanding request is satisfied; unlock the inode" ... thus causing operations to be ordered: inodes are per-FS objects, so any stalling by waiting for operations to be acknowledged,k rather than simply blindly queueing them has to be in FS specific code). The reality is even more complicated; I've hidden a number of layers, and I generalized quite a bit (I'm surprised no one has complained, yet). For example, between the VFS and CAM is the VM, and a block I/O system based on page-based I/O management, with specific exceptions allowed for physical disk block access on a block-by-block basis for directory entries. So when you do a write of 35 bytes, you may end up having to do a lot more than you think (e.g. page in the page containing the 35 bytes -- or two, if it spans a page boundary, modify the 35 bytes with a copy from the user space buffer to the page hung off the vm_object_t off the vnode to the file, pointed to by the per process open file table pointed to by the proc structure, mark the page as dirty, sleep pending it being written out if what we are dealing with is a metadata change, etc.). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message