From owner-freebsd-hackers Sun Aug 4 23:43: 2 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ADD7537B400; Sun, 4 Aug 2002 23:42:51 -0700 (PDT) Received: from harrier.mail.pas.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3EFD143E6A; Sun, 4 Aug 2002 23:42:51 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0179.cvx40-bradley.dialup.earthlink.net ([216.244.42.179] helo=mindspring.com) by harrier.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 17bbZf-0007gI-00; Sun, 04 Aug 2002 23:42:40 -0700 Message-ID: <3D4E1E0D.582EBE7C@mindspring.com> Date: Sun, 04 Aug 2002 23:41:17 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Lamont Granquist Cc: "Justin T. Gibbs" , Zhihui Zhang , freebsd-hackers@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG Subject: Re: transaction ordering in SCSI subsystem References: <20020804223605.X892-100000@coredump.scriptkiddie.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Lamont Granquist wrote: > So what exactly gets ordered and how do things get tagged? > > I tried following this in the code from VOP_STRATEGY and never quite > figured it out. Basically when you do a write are you just tagging the > data writes along with the metadata writes and then sequencing them so > that they have to complete in a given order? And can operations with > different tags be mixed around randomly? > > Also, how does the feedback from the SCSI controller that the write > completed get used by the O/S Requests are issued to CAM. CAM issues requests with tags to SCSI controller. SCSI controller issues commands to target on SCSI bus using a tag. Target completes command, issues "completed" on tag. SCSI controller write status to memory for request struct. SCSI controller issues interrupt. ISR in SCSI driver runs, and notes completed request. ISR notifies CAM. Operations on tags may be concurrently outstanding. There are a limited number of concurrent operations permitted to be outstanding, as dictated by the number of tags supported by a physical disk drive. Operations which can occur concurrently are requested concurrently; the order in which they complete does not matter. Operations which can *not* occur concurrently are requested only serially. This serialization is called a "stall barrier": the next operation is not attempted until the previous operation has been committed to stable storage. Operations at the CAM layer are proxied transactions; as Justin stated, operations queued to CAM are guaranteed to be queued to the underlying physical device in the same order. The FS is responsible for introducing stall barriers, as necessary, to enbsure metadata integrity. If the FS guarantees user data integrity as well, then it must introduce stall barriers for that, as well. The minimal requirement for end-to-end data integrity is for the operating system to guarantee metadata integrity -- transactional idempotence of operations in order to guarantee atomicity -- and the application to provide user data integrity through proper use of metadata operation ordering in order to implement user data transactioning. Usually, this includes explicit data sychronization to disk using fsync(2) calls, if user data integrity is required. In most cases, user data integrity is implied; if, on the other hand, you have seperate files for data record indexing and data record storage, you must provide for explicit synchronization, because you are implying application metadata within user data regions of files, in order to provide services on top of the OS platform, which the OS platform itself does not provide. There are several ways for an FS to ensure metadata integrity. The easiest to implement is synchronous metadata operations. This implies a stall barrier after each metadata operation, prohibiting subsequent metadata operations until the single outstanding operations permitted by the FS is committed to stable storage. In this way, metadata operations ordering is assurred. The second easiest to implement is ordered metadata operations. This is accomplished by dividing metadata operations into sets of "dependent" and "independent" operations. Operations which are "independent" are permitted to occurr concurrently. Operations which are "dependent" imply a stall barrier. This method is formally called "Delayed Order Writes", or "DOW". There are two USL patenets on this (both assigned to Novell). For this reason, if you want to sell your FS in the U.S., you will not use this approach. The third method is much more difficult to implement, since it requires an understanding of graph thoery. It's called "soft updates" (sometimes it's called "soft dependencies") and was invented by Gregory Ganger and Yale Patt. Operations are registered in dependency order into a graph, and stal barriers are only introduced on non-commutive edge traversals. This ends up introducing much fewer stall barriers, overall. In addition, operations which roll forward then backward (e.g. access timestamp updates on intermediate object files which are deleted as part of a compilation process) are never committed to disk; thus only permanent changes end up committed, so long as the operations occur within the update clock time window. If an operation occurs that requires a stall barrier, then a stall barrier is introduced. While it's technically possible to export a transactioning interface to user space programs for all three of these approaches, in practice it is difficult to implement properly. The easiest approach is to simply extend the graph edge in the soft updates case. This has the additional benefit, in a stacking vnode architecture, of avoiding the normally introduced stall barriers that occur between VFS layers, unless there are real dependencies (i.e. the VFS/VFS boundary will normally introduce an artificial stall barrier). For this to be done in FreeBSD would require generalizing the soft updates dependency graph relationship code, to permit registration of node/node edge dependency resolvers (which are explicit in the current soft updates implementation). So the answer to your question is that metadata writes and data writes are treated seperately, and you must write code in your application to deal with user data, rather than relying on the OS to do it for you. For more information on how to deal with this, take a 300 level database class at your local university and/or do a search on the phrase "two stage commit". > (and the corollary being how does IDE write > caching lying about completion affect the O/S and the data integrity)? If a drive lies about having committed data to stable storage (it doesn't matter if it's an IDE drive or a SCSI drive, but IDE drives tend to be crhronic liars), then it causes the SCSI controller to lie to CAM. When the SCSI controller lies to CAM, then it causes CAM to lie to the VFS. When the CAM lies to the VFS, the VFS lies about metadata integrity guarantees, and lies about user data having been commited to stable storage before the fsync(2) call returns. After which the kernel lies to the application program, and the application program lies to the human running it. Moral: do not buy hardware which lies to you, unless you want to have your software lie to you. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message