Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Jul 2012 16:34:02 -0400
From:      Steve McCoy <smccoy@greatbaysoftware.com>
To:        jhb@freebsd.org
Cc:        Charles Owens <cowens@greatbaysoftware.com>, freebsd-stable@freebsd.org
Subject:   Re: mfi(4) IO performance regression, post 8.1
Message-ID:  <4FFF34BA.9030002@greatbaysoftware.com>
In-Reply-To: <4FFF301E.30603@greatbaysoftware.com>
References:  <4FDABA0B.5030702@greatbaysoftware.com> <201206150804.46341.jhb@freebsd.org> <4FE3DA14.9090506@greatbaysoftware.com> <4FFF301E.30603@greatbaysoftware.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 7/12/12 4:14 PM, Charles Owens wrote:
> On Thursday, June 21, 2012 10:36:04 pm Charles Owens wrote:
>>
>> On 6/15/12 8:04 AM, John Baldwin wrote:
>> > On Friday, June 15, 2012 12:28:59 am Charles Owens wrote:
>> >> Hello FreeBSD folk,
>> >>
>> >> We're seeing what appears to be a storage performance regression as we
>> >> try to move from 8.1 (i386) to 8.3.   We looked at 8.2 also and it
>> >> appears that the regression happened between 8.1 and 8.2.
>> >>
>> >> Our system is an Intel S5520UR Server with 12 GB RAM, dual 4-core
>> CPUs.
>> >> Storage is a LSI MegaSAS 1078 controller (mfi) in a RAID-10
>> >> configuration, using UFS + geom_journal for filesystem.
>> >>
>> >> Postgresql performance, as seen via pgbench, dropped by approx 20%.
>> >> This testing was done with our usual PAE-enabled kernels.  We then
>> went
>> >> back to GENERIC kernels and did comparisons using "bonnie", results
>> >> below.  Following that is a kernel boot log.
>> >>
>> >> Notably, we're seeing this regression only with our RAID mfi(4) based
>> >> systems.  Notably, from looking at FreeBSD source changelogs it
>> appears
>> >> that the mfi(4) code has seen some changes since 8.1.
>> > Between 8.1 and 8.2 mfi has not had any significant changes.  The
>> only changes
>> > made to sys/dev/mfi were to add a new constant:
>> >
>> >> svn diff svn+ssh://svn.freebsd.org/base/releng/8.1/sys/dev/mfi
>> > svn+ssh://svn.freebsd.org/base/releng/8.2/sys/dev/mfi
>> > Index: mfireg.h
>> > ===================================================================
>> > --- mfireg.h    (.../8.1/sys/dev/mfi)   (revision 237134)
>> > +++ mfireg.h    (.../8.2/sys/dev/mfi)   (revision 237134)
>> > @@ -975,7 +975,9 @@
>> >          MFI_PD_STATE_OFFLINE = 0x10,
>> >          MFI_PD_STATE_FAILED = 0x11,
>> >          MFI_PD_STATE_REBUILD = 0x14,
>> > -       MFI_PD_STATE_ONLINE = 0x18
>> > +       MFI_PD_STATE_ONLINE = 0x18,
>> > +       MFI_PD_STATE_COPYBACK = 0x20,
>> > +       MFI_PD_STATE_SYSTEM = 0x40
>> >   };
>> >
>> >   union mfi_ld_ref {
>> >
>> > The difference in write performance must be due to something else.  You
>> > mentioned you are using UFS + gjournal.  I think gjournal uses
>> BIO_FLUSH, so I
>> > wonder if this is related:
>> >
>> >
>> ------------------------------------------------------------------------
>> > r212939 | gibbs | 2010-09-20 19:39:00 -0400 (Mon, 20 Sep 2010) | 61
>> lines
>> >
>> > MFC 212160:
>> >
>> > Correct bioq_disksort so that bioq_insert_tail() offers barrier
>> semantic.
>> > Add the BIO_ORDERED flag for struct bio and update bio clients to
>> use it.
>> >
>> > The barrier semantics of bioq_insert_tail() were broken in two ways:
>> >
>> >   o In bioq_disksort(), an added bio could be inserted at the head of
>> >     the queue, even when a barrier was present, if the sort key for
>> >     the new entry was less than that of the last queued barrier bio.
>> >
>> >   o The last_offset used to generate the sort key for newly queued bios
>> >     did not stay at the position of the barrier until either the
>> >     barrier was de-queued, or a new barrier (which updates last_offset)
>> >     was queued.  When a barrier is in effect, we know that the disk
>> >     will pass through the barrier position just before the
>> >     "blocked bios" are released, so using the barrier's offset for
>> >     last_offset is the optimal choice.
>> >
>> > sys/geom/sched/subr_disk.c:
>> > sys/kern/subr_disk.c:
>> >          o Update last_offset in bioq_insert_tail().
>> >
>> >          o Only update last_offset in bioq_remove() if the removed
>> bio is
>> >            at the head of the queue (typically due to a call via
>> >            bioq_takefirst()) and no barrier is active.
>> >
>> >          o In bioq_disksort(), if we have a barrier (insert_point is
>> non-NULL),
>> >            set prev to the barrier and cur to it's next element.
>> Now that
>> >            last_offset is kept at the barrier position, this change
>> isn't
>> >            strictly necessary, but since we have to take a decision
>> branch
>> >            anyway, it does avoid one, no-op, loop iteration in the
>> while
>> >            loop that immediately follows.
>> >
>> >          o In bioq_disksort(), bypass the normal sort for bios with the
>> >            BIO_ORDERED attribute and instead insert them into the queue
>> >            with bioq_insert_tail().  bioq_insert_tail() not only gives
>> >            the desired command order during insertion, but also
>> provides
>> >            barrier semantics so that commands disksorted in the future
>> >            cannot pass the just enqueued transaction.
>> >
>> > sys/sys/bio.h:
>> >          Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.
>> >
>> > sys/cam/ata/ata_da.c:
>> > sys/cam/scsi/scsi_da.c
>> >          Use an ordered command for SCSI/ATA-NCQ commands issued in
>> >          response to bios with the BIO_ORDERED flag set.
>> >
>> > sys/cam/scsi/scsi_da.c
>> >          Use an ordered tag when issuing a synchronize cache command.
>> >
>> >          Wrap some lines to 80 columns.
>> >
>> > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>> > sys/geom/geom_io.c
>> >          Mark bios with the BIO_FLUSH command as BIO_ORDERED.
>> >
>> > Sponsored by:   Spectra Logic Corporation
>> >
>> ------------------------------------------------------------------------
>> >
>> > Can you try perhaps commenting out the 'bp->bio_flags |=
>> BIO_ORDERED' line
>> > changed in geom_io.c in 8.2?  That would be effectively reverting this
>> > portion of the diff:
>> >
>> > Index: geom_io.c
>> > ===================================================================
>> > --- geom_io.c   (.../8.1/sys/geom)      (revision 237134)
>> > +++ geom_io.c   (.../8.2/sys/geom)      (revision 237134)
>> > @@ -265,6 +265,7 @@
>> >          g_trace(G_T_BIO, "bio_flush(%s)", cp->provider->name);
>> >          bp = g_alloc_bio();
>> >          bp->bio_cmd = BIO_FLUSH;
>> > +       bp->bio_flags |= BIO_ORDERED;
>> >          bp->bio_done = NULL;
>> >          bp->bio_attribute = NULL;
>> >          bp->bio_offset = cp->provider->mediasize;
>> >
>>
>> John... thanks for the suggestion.  I've built and tested a kernel with
>> this change made.  Result:  no change (same performance as with
>> 8.2-GENERIC). Any thoughts as to where to go next?
>
> Hmm.  That seemed the most plausible candidate when I looked at this.
>
> Do you use quotas (there is one change in UFS related to quotas)?
>
> There are 5 changes that involve sys/kern/vfs_bio.c in 8.2:
> 209459, 212229, 212562, 212583, and 213890.
>
> Can you possibly test out kernels from stable/8 at those revisions on an
> 8.1
> world and see if you can narrow it down futher?
>
> Barring that, can you do a binary search of kernels from stable/8
> between 8.1
> and 8.2 on an 8.1 world to see which commit caused the change in write
> performance?
>

Hi John, I'm working with Charles to narrow this down.

Looks like revision 212229 is the culprit, or at least around the same 
time to it, if this change isn't what slowed things down. The change to 
sys/kern/vfs_bio.c modifies some synchronization in dev_strategy():


--- stable/8/sys/kern/vfs_bio.c	2010/06/23 10:06:57	209459
+++ stable/8/sys/kern/vfs_bio.c	2010/09/05 14:27:55	212229
@@ -3195,6 +3195,7 @@
  {
  	struct cdevsw *csw;
  	struct bio *bip;
+	int ref;

  	if ((!bp->b_iocmd) || (bp->b_iocmd & (bp->b_iocmd - 1)))
  		panic("b_iocmd botch");
@@ -3216,7 +3217,7 @@
  	KASSERT(dev->si_refcount > 0,
  	    ("dev_strategy on un-referenced struct cdev *(%s)",
  	    devtoname(dev)));
-	csw = dev_refthread(dev);
+	csw = dev_refthread(dev, &ref);
  	if (csw == NULL) {
  		g_destroy_bio(bip);
  		bp->b_error = ENXIO;
@@ -3225,7 +3226,7 @@
  		return;
  	}
  	(*csw->d_strategy)(bip);
-	dev_relthread(dev);
+	dev_relthread(dev, ref);
  }

  /*


This is in line with the following changes to dev_refthread and 
dev_relthread, but it's not obvious to me how these alone would impact 
performance. Do you have any thoughts?


--- stable/8/sys/kern/kern_conf.c	2010/07/03 18:19:59	209668
+++ stable/8/sys/kern/kern_conf.c	2010/09/05 14:27:55	212229
@@ -177,12 +177,16 @@
  }

  struct cdevsw *
-dev_refthread(struct cdev *dev)
+dev_refthread(struct cdev *dev, int *ref)
  {
  	struct cdevsw *csw;
  	struct cdev_priv *cdp;

  	mtx_assert(&devmtx, MA_NOTOWNED);
+	if ((dev->si_flags & SI_ETERNAL) != 0) {
+		*ref = 0;
+		return (dev->si_devsw);
+	}
  	dev_lock();
  	csw = dev->si_devsw;
  	if (csw != NULL) {
@@ -193,36 +197,59 @@
  			csw = NULL;
  	}
  	dev_unlock();
+	*ref = 1;
  	return (csw);
  }

…

void	
-dev_relthread(struct cdev *dev)
+dev_relthread(struct cdev *dev, int ref)
  {

  	mtx_assert(&devmtx, MA_NOTOWNED);
+	if (!ref)
+		return;
  	dev_lock();
  	KASSERT(dev->si_threadcount > 0,
  	    ("%s threadcount is wrong", dev->si_name));




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4FFF34BA.9030002>