From owner-freebsd-hackers Wed Feb 12 10:26:11 1997 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.5/8.8.5) id KAA17898 for hackers-outgoing; Wed, 12 Feb 1997 10:26:11 -0800 (PST) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.8.5/8.8.5) with SMTP id KAA17891 for ; Wed, 12 Feb 1997 10:26:06 -0800 (PST) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id LAA00856; Wed, 12 Feb 1997 11:20:31 -0700 From: Terry Lambert Message-Id: <199702121820.LAA00856@phaeton.artisoft.com> Subject: Re: Raw I/O Question To: Shimon@i-Connect.Net (Simon Shapiro) Date: Wed, 12 Feb 1997 11:20:31 -0700 (MST) Cc: terry@lambert.org, freebsd-hackers@freebsd.org In-Reply-To: from "Simon Shapiro" at Feb 11, 97 10:49:46 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > > > Can someone take a moment and describe briefly the execution path of a > > > lseek/read/write system call to a raw (character) SCSI partition? > > > > You skipped a specification step: the FS layout on that partition. > > I will assume FFS with 8k block size (the default). > > I skipped nothing :-) there is NO file system on the partition. > Just a simple file (partitions are files. not in a file system, > but files. Right? :-) So you are writing a user space FS to use the partition. My previous posting referred only to FS-formatted block devices; I interpreted "raw" to mean something other than "raw device" because when I heard hoofbeats, I thought "horses". Do you work for Oracle? 8-). I think the problem comes down to how you handle your commits, and exactly what your on disk structure looks like, and exactly what you plan to do to your device driver to support your user space FS. The "lseek" is basically the same, but the "read" and the "write" are not. They go through the struct fileops and into the ops defined by specfs for character devices, and through there, directly to the strategy routines through cdevsw reference. I profoundly believe you should not use character devices for this. I also believe the FS should be in the kernel, not in user space, to avoid unnecessary protection domain crossing, and the resulting context switching that will cause. We can squeeze some additional code path out of there by eliminating struct fileops; it's not that hard to do; it's the result of a partial integration of vnode ops into the VFS framework (the VFS framework was a rush job -- USL attempted to cripple the ability to make a bootable OS by surgically claiming 6 pieces of the kernel, which really funneled down to 5 critical subsystems. The new VFS code was a workaround for the consent decree for one of those (IMO). > > > We are very interested in the most optimal, shortest path to I/O on > > > a large number of disks. > > > > o Write in the FS block size, not the disk block size to > > avoid causing a read before the write can be done > > No file system. See above. What is the block size used then? This is dependent on the device and the device driver. For disk devices, the block size is 512b. This only really applies well if you are using the block device, not the character device, which I recommend you do. > All these, stripping off the file system pointers, as they do not apply) > are good and valid, except: > > 1. We have to guarantee transactions. This means that system failure, > at ANY time cannot ``undo'' what is said to be done. IOW, a WRITE > call that returns positively, is known to have been recorded, on > error/failure resistant medium. We will be using DPT RAID > controlles for a mix of RAID-1 and RAID-5, as the case justifies. Are you using a journal, a log, or some other method to handle implied state across domains? For example, say I have an index and a bunch of records pointed to by that index. In order to do the transaction, I need a two stage commit: I allocate a new record, I write it, and I then rewrite the index. In practical terms, this is: i) alloc new record ii) write new record iii) commit new record iv) write new index v) deallocate old record ...in other words, a standard two stage commit process across two files. If you are using a log, in case of failure, you can "undo" any partially complete transactions (in the commit order above, you can recover by back-out only... allocated records without indices on next startup are deallocated). In the general case, you can roll your transaction forward if you add: .) start transaction with "intent" record i) alloc new record ii) write new record .) write "record data valid" -- this replaces "commit" XXX failure after this point can be rolled forward using "intent" record. iv) write new index v) deallocate old record .) mark transaction complete Note: THIS DOES NOT REQUIRE COMMIT TO DISK EXCEPT FOR THE LOG. You must guarantee order of actual write operation, but not that each write operation has actually taken place before you start the next operation. So basically, what you have to commit vs. what you have to order can save you a hell of a lot of waiting. > 2. Our basic recorded unit is less than 512 bytes long. We compromise > and round it (way) up to 512v since nobody makes fast disk drives > with sectors smaller than that anymore. Yes, SCSI being what it > is, even 512 is way small. We know... Then this must be for write transaction unit. It is too bad you are not using a RAID 4 stripe set and writing exactly a full stripe at a time using spindle sync. > 3. In case of system failure (most common reason today == O/S crash) we > must be back in operation within less than 10 seconds. We do that by > sharing the disks with another sytem, which is already up. ??? You mean sharing the physical drives, or you mean a network share? I'm guessing physical sharing? There are some not very general things you can do to make an OS boot nearly instantly (I keep wanting them for private APM modes and for system install). For one, you could keep a log of system state, and restore system state from the log, rather than booting normally. This requires the cooperation of the device drivers and certain parts of the boot process. > 4. We need to process very large number of interrupts. In fact, so > many that one FreeBSD CPU cannot keep up. So, we are back to shared > disks. I suspect you are using PCI controllers. PCI does not support "fast interrupts". Contact Bruce Evans for details on how you can fix this. > 5. Because disks are shared, the write state must be very deterministic > at all times. As O/S have caches, RAID controllers have caches, > disks have caches, we have to have some sense of who has what in > which cache when. Considering the O/S to be the most lossy element > in the system, we have to keep the amount of WRITE caches to a > minimum. Unless they are non-volatile, anyway. > > (zero locality of reference: a hard thing to find in the real world) > > prevent the read-ahead from being invoked. > > Ah! there is a read-ahead on raw devices? How do we shut it down? There is read-ahead for any device which is sequentially accessed. If you do not sequentially acces, you will not trigger read-ahead. This is a non-problem (I think). [ ... block size ... ] > How does all this relate to raw/character devices? It doesn't (see up top; I didn't think you really meant character devices when you said "raw"). But neither does the original question, then, since block size is largely irrelevent above device block size granularity. It will depend on the disk diver, and the controller cache size, and whether or not the disk supports track write caching itself. > > > What we see is a flat WRITE response until 2K. then it starts a linear > > > decline until it reaches 8K block size. At this point it converges > > > with READ performance. The initial WRITE performance, for small blocks > > > is quite poor compared to READ. We attribute it to the need to do > > > read-modify-write when blocks are smaller than a certain ``natural block > > > size (page?). > > > > Yes. But the FS block size s 8k, not pagesize (4k). > > We were not using a filesystem. That's the point. Then it's undefined, and it's relative to the controller/disk combination only. > O_WRITESYNC! This is an open(2) option that says that all write's are > synchronous (do not return until actually done). Right? And it applies > to block devices, as well as filesystem files. Right? Yes. It internally does the same thing you are doing, without the additional transition out then back in across the protection domain, with the accompanying possibility of context switch. > The ``only'' difference is additional 200 system calls per second? How many > of these can a Pentium-Pro, 512K cache, 128MB RAM, etc. can do in one > second. > We are always in the 1,000+ in our budget. 20% increase is a lot to us. It depends on where you are bound up. If all writes are synchronus, you are bound up in disk I/O, not system call overhead. If writes are being guaranteed, and you don't force synchronicity to imply idempotence acoss disk operations that aren't themselves atomic (ie: index/data releationships), then you may see the system call overhead. I know I have been on projects where this is important enough that we defined our own system calls to combine write-then-read operations on networks, I/O and stat operations, and pattern matching in the kernel so that irrelevent data is not pushed back over the getdents interface, etc.. > > Most likely, you do not really need this, or you are poorly implementing > > the two stage commit process typical of most modern database design. > > Assumptions, assumptions... :-) There is no database, there is no 2phase > commit here. Wish I could share more details in this forum, but I am > already stretching it :-( I'd have to say your synchronicity requirements are probably specious, then. What you really have are transaction ordering requirements, and, as noted above, you don't have to have synchronicity to implement them. > > > The READ performance is even more peculiar. It starts higher than > > > WRITE, declines rapidly until block size reaches 2K. It peaks at 4K > > > blocks and starts a linear decline from that point on (as block size > > > increases). > > > > This is because of precache effects. Your "random" reads are not > > "random" enough to get rid of cache effects, it seems. If they were, > > the 4k numbers would be worse, and the peak would be the FS block size. > > On a block device? Which filesystem? Well, disk block size, then. > The same tests described here were run on a well known commercial OS. It > exhibits totally flat response from 512 bytes to 4Kb blocks. What happened > at 8K blocks and larger? The process will totally hang if you did > read + (O_SYNC) write on the same file at the same time. Cute. Sounds like they have a single queue for locking operations on vnodes. If I had to guess, your commercial OS was Solaris 2.x, x>=3. I really disagree with the way Solaris implements its SMP locking; it doesn't scale well, it's not as concurrent as they'd like you to believe, and it's hard for third parties to use. I wish you could try it on a Unisys 6000/50 SVR4.0.2 ES/MP system; I believe they did vnode locking correctly. > > Jorg, Julian, and the specific SCSI driver authors are probably > > your best resource below the bdevsw[] layer. > > I appreciate that. I have not seen anything in the SCSI layer that really > ``cares'' about the type of I/O done. It all appears the same. In general, it's not supposed to care. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.