Date: Thu, 12 Aug 2010 15:02:17 -0600 From: "Justin T. Gibbs" <gibbs@scsiguy.com> To: current@FreeBSD.org Subject: Call for Review: Suport for BIO_ORDERED bios Message-ID: <4C646159.3080604@scsiguy.com>
next in thread | raw e-mail | index | archive | help
The following patches introduce the BIO_ORDERED flag for bios. A bio tagged with the BIO_ORDERED flag has barrier semantics: all bios queued before the BIO_ORDERED bio are executed before the BIO_ORDERED bio and any bios queued after the BIO_ORDERED bio are executed after the BIO_ORDERED bio. FreeBSD has offered these semantics before. I added support for ordered bufs over a decade ago, but with no consumers of that interface in the tree, it was removed. Today the landscape has changed: ZFS requires ordered semantics to be able to safely commit transactions, Jeff Roberson has talked of modifying UFS to take advantage of this feature to speed up soft-updates, Linux ext3/4 filesystems will take advantage of this feature, and VMs using storage exported via Xen's blkback interface can also querry for this feature. My changes are sufficient to allow ZFS and the Xen blkback driver (more on that driver in another email) to perform ordered I/O. Additional, mostly trivial, changes will be required to pass ordering information through the buf interface if/when other buf clients grow to use this capability. Are there any comments/concerns about these changes before I commit them? Thanks, Justin Change 216125 by justing@spectrabsd.import on 2010/07/29 16:24:07 Add the BIO_ORDERED flag for struct bio and update bio clients to use it. sys/sys/bio.h: Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio. sys/kern/subr_disk.c: In bioq_disksort(), bypass the normal sort for bios with the BIO_ORDERED attribute and instead insert them into the queue with bioq_insert_tail(). bioq_insert_tail() not only gives the desired command order during insertion, but also provides barrier semantics so that commands disksorted in the future cannot pass the just enqueued transaction. sys/cam/ata/ata_da.c: sys/cam/scsi/scsi_da.c Use an ordered command for SCSI/ATA-NCQ commands issued in response to bios with the BIO_ORDERED flag set. sys/cam/scsi/scsi_da.c Use an ordered tag when issuing a synchronize cache command. Wrap some lines to 80 columns. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c sys/geom/geom_io.c Mark bios with the BIO_FLUSH command as BIO_ORDERED. Affected files ... ... //depot/SpectraBSD/head/sys/cam/ata/ata_da.c#5 edit ... //depot/SpectraBSD/head/sys/cam/scsi/scsi_da.c#4 edit ... //depot/SpectraBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c#3 edit ... //depot/SpectraBSD/head/sys/geom/geom_io.c#4 edit ... //depot/SpectraBSD/head/sys/kern/subr_disk.c#3 edit ... //depot/SpectraBSD/head/sys/sys/bio.h#3 edit Change 216126 by justing@spectrabsd.import on 2010/07/29 16:35:46 Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic. The barrier semantics of bioq_insert_tail() were broken in two ways: o In bioq_disksort(), an added bio could be inserted at the head of the queue, even when a barrier was present, if the sort key for the new entry was less than that of the last queued barrier bio. o The last_offset used to generate the sort key for newly queued bios did not stay at the position of the barrier until either the barrier was de-queued, or a new barrier (which updates last_offset) was queued. When a barrier is in effect, we know that the disk will pass through the barrier position just before the "blocked bios" are released, so using the barrier's offset for last_offset is the optimal choice. sys/geom/sched/subr_disk.c: sys/kern/subr_disk.c: o Update last_offset in bioq_insert_tail(). o Only update last_offset in bioq_remove() if the removed bio is at the head of the queue (typically due to a call via bioq_takefirst()) and no barrier is active. o In bioq_disksort(), if we have a barrier (insert_point is non-NULL), set prev to the barrier and cur to it's next element. Now that last_offset is kept at the barrier position, this change isn't strictly necessary, but since we have to take a decision branch anyway, it does avoid one, no-op, loop iteration in the while loop that immediately follows. Affected files ... ... //depot/SpectraBSD/head/sys/geom/sched/subr_disk.c#3 edit ... //depot/SpectraBSD/head/sys/kern/subr_disk.c#4 edit --- //depot/vendor/FreeBSD/head/sys/cam/ata/ata_da.c 2010-07-25 15:43:52.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/cam/ata/ata_da.c 2010-07-25 15:43:52.000000000 -0600 @@ -874,7 +874,8 @@ } bioq_remove(&softc->bio_queue, bp); - if ((softc->flags & ADA_FLAG_NEED_OTAG) != 0) { + if ((bp->bio_flags & BIO_ORDERED) != 0 + || (softc->flags & ADA_FLAG_NEED_OTAG) != 0) { softc->flags &= ~ADA_FLAG_NEED_OTAG; softc->ordered_tag_count++; tag_code = 0; --- //depot/vendor/FreeBSD/head/sys/cam/scsi/scsi_da.c 2010-07-25 15:43:52.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/cam/scsi/scsi_da.c 2010-07-25 15:43:52.000000000 -0600 @@ -1354,7 +1354,8 @@ bioq_remove(&softc->bio_queue, bp); - if ((softc->flags & DA_FLAG_NEED_OTAG) != 0) { + if ((bp->bio_flags & BIO_ORDERED) != 0 + || (softc->flags & DA_FLAG_NEED_OTAG) != 0) { softc->flags &= ~DA_FLAG_NEED_OTAG; softc->ordered_tag_count++; tag_code = MSG_ORDERED_Q_TAG; @@ -1368,7 +1369,8 @@ /*retries*/da_retry_count, /*cbfcnp*/dadone, /*tag_action*/tag_code, - /*read_op*/bp->bio_cmd == BIO_READ, + /*read_op*/bp->bio_cmd + == BIO_READ, /*byte2*/0, softc->minimum_cmd_size, /*lba*/bp->bio_pblkno, @@ -1377,17 +1379,24 @@ /*data_ptr*/ bp->bio_data, /*dxfer_len*/ bp->bio_bcount, /*sense_len*/SSD_FULL_SIZE, - /*timeout*/da_default_timeout*1000); + da_default_timeout * 1000); break; case BIO_FLUSH: + /* + * BIO_FLUSH doesn't currently communicate + * range data, so we synchronize the cache + * over the whole disk. We also force + * ordered tag semantics the flush applies + * to all previously queued I/O. + */ scsi_synchronize_cache(&start_ccb->csio, /*retries*/1, /*cbfcnp*/dadone, - MSG_SIMPLE_Q_TAG, - /*begin_lba*/0,/* Cover the whole disk */ + MSG_ORDERED_Q_TAG, + /*begin_lba*/0, /*lb_count*/0, SSD_FULL_SIZE, - /*timeout*/da_default_timeout*1000); + da_default_timeout*1000); break; } start_ccb->ccb_h.ccb_state = DA_CCB_BUFFER_IO; --- //depot/vendor/FreeBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c 2010-07-12 23:49:04.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c 2010-07-12 23:49:04.000000000 -0600 @@ -598,6 +598,7 @@ break; case ZIO_TYPE_IOCTL: bp->bio_cmd = BIO_FLUSH; + bp->bio_flags |= BIO_ORDERED; bp->bio_data = NULL; bp->bio_offset = cp->provider->mediasize; bp->bio_length = 0; --- //depot/vendor/FreeBSD/head/sys/geom/geom_io.c 2010-06-10 17:49:36.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/geom/geom_io.c 2010-06-10 17:49:36.000000000 -0600 @@ -265,6 +265,7 @@ g_trace(G_T_BIO, "bio_flush(%s)", cp->provider->name); bp = g_alloc_bio(); bp->bio_cmd = BIO_FLUSH; + bp->bio_flags |= BIO_ORDERED; bp->bio_done = NULL; bp->bio_attribute = NULL; bp->bio_offset = cp->provider->mediasize; --- //depot/vendor/FreeBSD/head/sys/geom/sched/subr_disk.c 2010-04-12 16:37:45.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/geom/sched/subr_disk.c 2010-04-12 16:37:45.000000000 -0600 @@ -86,7 +86,7 @@ * bioq_remove() remove a generic element from the queue, act as * bioq_takefirst() if invoked on the head of the queue. * - * The semantic of these methods is the same of the operations + * The semantic of these methods is the same as the operations * on the underlying TAILQ, but with additional guarantees on * subsequent bioq_disksort() calls. E.g. bioq_insert_tail() * can be useful for making sure that all previous ops are flushed @@ -115,10 +115,10 @@ gs_bioq_remove(struct bio_queue_head *head, struct bio *bp) { - if (bp == TAILQ_FIRST(&head->queue)) - head->last_offset = bp->bio_offset + bp->bio_length; - - if (bp == head->insert_point) + if (head->insert_point == NULL) { + if (bp == TAILQ_FIRST(&head->queue)) + head->last_offset = bp->bio_offset + bp->bio_length; + } else if (bp == head->insert_point) head->insert_point = NULL; TAILQ_REMOVE(&head->queue, bp, bio_queue); @@ -137,7 +137,8 @@ gs_bioq_insert_head(struct bio_queue_head *head, struct bio *bp) { - head->last_offset = bp->bio_offset; + if (head->insert_point == NULL) + head->last_offset = bp->bio_offset; TAILQ_INSERT_HEAD(&head->queue, bp, bio_queue); } @@ -147,6 +148,7 @@ TAILQ_INSERT_TAIL(&head->queue, bp, bio_queue); head->insert_point = bp; + head->last_offset = bp->bio_offset; } struct bio * @@ -189,13 +191,28 @@ void gs_bioq_disksort(struct bio_queue_head *head, struct bio *bp) { - struct bio *cur, *prev = NULL; - uoff_t key = gs_bioq_bio_key(head, bp); + struct bio *cur, *prev; + uoff_t key; + + if ((bp->bio_flags & BIO_ORDERED) != 0) { + /* + * Ordered transactions can only be dispatched + * after any currently queued transactions. They + * also have barrier semantics - no transactions + * queued in the future can pass them. + */ + gs_bioq_insert_tail(head, bp); + return; + } + prev = NULL; + key = gs_bioq_bio_key(head, bp); cur = TAILQ_FIRST(&head->queue); - if (head->insert_point) - cur = head->insert_point; + if (head->insert_point) { + prev = head->insert_point; + cur = TAILQ_NEXT(head->insert_point, bio_queue); + } while (cur != NULL && key >= gs_bioq_bio_key(head, cur)) { prev = cur; --- //depot/vendor/FreeBSD/head/sys/kern/subr_disk.c 2010-07-18 20:57:53.000000000 -0600 +++ /home/justing/perforce/FreeBSD/head/sys/kern/subr_disk.c 2010-07-18 20:57:53.000000000 -0600 @@ -127,7 +127,7 @@ * bioq_remove() remove a generic element from the queue, act as * bioq_takefirst() if invoked on the head of the queue. * - * The semantic of these methods is the same of the operations + * The semantic of these methods is the same as the operations * on the underlying TAILQ, but with additional guarantees on * subsequent bioq_disksort() calls. E.g. bioq_insert_tail() * can be useful for making sure that all previous ops are flushed @@ -156,10 +156,10 @@ bioq_remove(struct bio_queue_head *head, struct bio *bp) { - if (bp == TAILQ_FIRST(&head->queue)) - head->last_offset = bp->bio_offset + bp->bio_length; - - if (bp == head->insert_point) + if (head->insert_point == NULL) { + if (bp == TAILQ_FIRST(&head->queue)) + head->last_offset = bp->bio_offset + bp->bio_length; + } else if (bp == head->insert_point) head->insert_point = NULL; TAILQ_REMOVE(&head->queue, bp, bio_queue); @@ -178,7 +178,8 @@ bioq_insert_head(struct bio_queue_head *head, struct bio *bp) { - head->last_offset = bp->bio_offset; + if (head->insert_point == NULL) + head->last_offset = bp->bio_offset; TAILQ_INSERT_HEAD(&head->queue, bp, bio_queue); } @@ -188,6 +189,7 @@ TAILQ_INSERT_TAIL(&head->queue, bp, bio_queue); head->insert_point = bp; + head->last_offset = bp->bio_offset; } struct bio * @@ -230,13 +232,28 @@ void bioq_disksort(struct bio_queue_head *head, struct bio *bp) { - struct bio *cur, *prev = NULL; - uoff_t key = bioq_bio_key(head, bp); + struct bio *cur, *prev; + uoff_t key; + + if ((bp->bio_flags & BIO_ORDERED) != 0) { + /* + * Ordered transactions can only be dispatched + * after any currently queued transactions. They + * also have barrier semantics - no transactions + * queued in the future can pass them. + */ + bioq_insert_tail(head, bp); + return; + } + prev = NULL; + key = bioq_bio_key(head, bp); cur = TAILQ_FIRST(&head->queue); - if (head->insert_point) - cur = head->insert_point; + if (head->insert_point) { + prev = head->insert_point; + cur = TAILQ_NEXT(head->insert_point, bio_queue); + } while (cur != NULL && key >= bioq_bio_key(head, cur)) { prev = cur; --- //depot/vendor/FreeBSD/head/sys/sys/bio.h 2009-12-11 10:35:58.000000000 -0700 +++ /home/justing/perforce/FreeBSD/head/sys/sys/bio.h 2009-12-11 10:35:58.000000000 -0700 @@ -54,6 +54,7 @@ #define BIO_ERROR 0x01 #define BIO_DONE 0x02 #define BIO_ONQUEUE 0x04 +#define BIO_ORDERED 0x08 #ifdef _KERNEL struct disk;
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C646159.3080604>