Date: Fri, 9 Oct 1998 22:02:44 -0700 (PDT) From: dan@math.berkeley.edu (Dan Strick) To: tlambert@primenet.com Cc: dan@math.berkeley.edu, freebsd-smp@FreeBSD.ORG Subject: Re: hw platform Q - what's a good smp choice these days? Message-ID: <199810100502.WAA17750@math.berkeley.edu>
next in thread | raw e-mail | index | archive | help
> We can argue about whether the FS code should be reading mode page 2 > and acting with the physical geometry in mind in order to minimize > actual seeks, and that FreeBSD's imaginary 4M cylinder is broken. I suspect we would both agree that teaching the FS code and the driver code to optimize for the actual disk geometry would be so painful and perhaps computationally expensive as to be not worth the effort. It is probably adequate to model a disk as a sequence of blocks with on the average a much larger latency between nonconsecutive blocks than between consecutive blocks. It may also be true that on the average the latency increases with the difference in block numbers, but the actual function is so jagged that this approximation is of uncertain value. I tend to divide disk activity into several categories which must be optimized separately. The first category is randomly located I/O requests separated by large disk latencies. In this case, I/O reordering is useful and the per command SCSI latencies are so small that they fit unnoticed within the large disk latencies. There is no practical difference between reads and writes. Simultaneously executed SCSI commands are not very useful. The second category is highly localized disk I/O for mostly noncontiguous chunks. The third category is contiguous disk I/O. Both of these categories have read and write cases. Since modern drives do speculative read ahead, the read cases behave similarly, but the the write cases are different. (Modern drives may also be capable of write-behind (i.e. cached writes) but they had better not do it with my valued data.) > The clustering code and locality of reference will generally ensure > that data locality for a given process is relatively high; this, > combined with the fact that most modern SCSI drives inverse-order > the blocks and start reading immediately after the seek (effectively > read-caching the hicgh locality data) will definitely shorten the > increment. > > Also, let me once again emphasize that the improvement is due to > interleaved I/O, which gets rid of the agregate latency, replacing > it with a single latency. You have lost me here. I must not understand the "aggregate latency" to which you refer. If we execute our I/O commands serially, we can divide the time into SCSI command latency (command processing overhead), disk latency (waiting for the heads to reach the data), disk data transfer time (between the heads and the drive data buffer), and DMA (between the drive and system main memory through the SCSI and PCI busses). A smart drive might overlap some of the disk data transfer with DMA and in the case of highly localized disk reads it might effectively overlap disk latency with disk data transfer by doing speculative read ahead. I don't understand what you mean by "interleaved I/O" or how this relates to the I/O sub-activities I have listed above. > > The one big advantage of tagged drivers is the possibility that disk > > activity could overlap DMA, but this of course depends on the smarts of > > the particular disk drive and the SCSI host adapter and it only matters > > if the disk latencies are so small that disk revs would be lost > > otherwise. (It is hard to draw a picture of this in ascii.) Even in this > > case, a smart driver that does 2 simultaneous SCSI commands might do > > as well as one that does 64. > > If there is an intervening seek, yes. But in general, the number > of sectors per cylinder has increased, not decreased, over time. Actually, I was visualizing disk writes to consecutive or nearly consecutive sectors with no intervening seeks or head switches at all. I was also visualizing the disk sectors written in order of increasing sector number so that the disk could be kept continuously busy providing that DMA is always completed before the next sector to be written comes underneath the heads. In this case, it could be very useful to begin DMA for the next write command before the current write command is complete. Two simultaneous SCSI I/O commands might be sufficient. Reversing the sector order in the track changes everything. Without detailed knowledge of the actual disk geometry, the only obvious tactic is to issue large writes (by merging I/O requests). It doesn't much matter if you do this with a single SCSI command or a bunch of simultaneous SCSI commands. If you do it with a single command, you have reason to hope that even a dumb drive will do all the sectors in a single track in a single rev and you are certain to eliminate some of the per-command overhead though you will also force all of the merged I/O requests to wait until the last is done. If you do it with multiple SCSI commands, you might benefit from early completion of some of the commands. On the other hand, the drive might choose to do the I/O inefficiently. > We can also state that a process waiting for I/O will be involuntarily > context switched, in favor of another process, and that the pool > retention time that we are really interested, in terms of determining > overall data transfer rates, is based on the transfer to user space, > not merely the transfer from the controller into system memory. As > before, this greatly amplifies the effects of serialized I/O, hence > my initial steep slope for my "stair-step" diagram. I think you are saying that the process of transferring data between wherever the device controller accesses it and the running program's virtual memory is something else that can be overlapped with the other I/O activities if only we are doing enough different things at once. I would guess that this transfer process takes place at least at main memory speeds, something on the order of 10 times the raw disk data transfer rate. I suspect the memory transfer latency can almost be ignored. (I also don't understand the significance of the context switch. Perhaps I don't understand something important about the PC I/O system.) > > This also applies to the special case of doing contiguous disk reads > > from a drive that does substantial read-ahead. There is no lost-rev > > issue, but overlapping DMA with something else is possible. In this > > case also, 2 simultaneous SCSI commands are probably as good as 64 > > performance improvements over the smart untagged driver cannot possibly > > exceed a factor of two. > > I can't really parse this, but if (1) the commands are overlapped, > and (2) operating against read-ahead cache on the disk itself, > then I can't see how more commands don't equal more performance, The bottleneck will probably be the raw disk (perhaps 10 MB/sec). The SCSI bus will probably be much faster and DMA will be much faster still. Even executing only one SCSI command at a time, all this additional activity and miscellaneous SCSI command overhead, even if serialized, will mostly overlap the raw disk transfer time. Example (1 8kb transfer): SCSI bus @ 20 MB/sec: 400 us raw disk @ 10 MB/sec: 800 us PCI DMA @ 120 MB/sec: 70 us SCSI command overhead: 500 us ------------------------------- ------------------------------ total: 970 us 800 us Note: "SCSI command overhead" includes time spent in the SCSI driver. Even so, it may be overstated. It does not include time which would overlap the SCSI bus transfer or the DMA (for the same SCSI command). It is not clear how much if any of the DMA overlaps the SCSI bus transfer. This overlap is not affected by using tagged SCSI commands. > in terms of linearly scaling. I don't think that it's likely, > unless the disk itself contains as many track buffers as some > high fraction of the number of tagged commands it supports (in > the ideal, 1:1), to achive optimal benefit, but it's certainly > unlikely to be as pessimal as taking a seek hit plus a rotational > latency, which is what your "2" implies... I don't think I mentioned seeks or rotational latencies in this case. My model assumes the disk drive is sucking bits off the disk as fast as they come and that the disk drive is passing those bits down the SCSI bus to the host adapter as fast as it asks for them. The raw disk activity is basically read-ahead. It cannot overlap itself. It happens continuously, even if the SCSI read commands are executed one at a time. The only affect of queuing multiple simultaneous SCSI commands is to possibly overlap the "SCSI command overhead" of one command with the DMA and SCSI bus data transfer of other commands. If complete overlap were achieved, the larger would remain. (Hence the limiting factor of "2".) In this case, the raw disk transfer rate limitation would also remain, so for these numbers the best improvement would be a factor of 970/800 (about 1.2). Dan Strick dan@math.berkeley.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-smp" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199810100502.WAA17750>