Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 13 Jul 2012 09:39:54 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-stable@freebsd.org
Cc:        Charles Owens <cowens@greatbaysoftware.com>, Steve McCoy <smccoy@greatbaysoftware.com>
Subject:   Re: mfi(4) IO performance regression, post 8.1
Message-ID:  <201207130939.54311.jhb@freebsd.org>
In-Reply-To: <4FFF9A50.40006@greatbaysoftware.com>
References:  <4FDABA0B.5030702@greatbaysoftware.com> <4FFF34BA.9030002@greatbaysoftware.com> <4FFF9A50.40006@greatbaysoftware.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, July 12, 2012 11:47:28 pm Steve McCoy wrote:
> On 7/12/12 4:34 PM, Steve McCoy wrote:
> > On 7/12/12 4:14 PM, Charles Owens wrote:
> >> On Thursday, June 21, 2012 10:36:04 pm Charles Owens wrote:
> >>>
> >>> On 6/15/12 8:04 AM, John Baldwin wrote:
> >>> > On Friday, June 15, 2012 12:28:59 am Charles Owens wrote:
> >>> >> Hello FreeBSD folk,
> >>> >>
> >>> >> We're seeing what appears to be a storage performance regression
> >>> as we
> >>> >> try to move from 8.1 (i386) to 8.3.   We looked at 8.2 also and it
> >>> >> appears that the regression happened between 8.1 and 8.2.
> >>> >>
> >>> >> Our system is an Intel S5520UR Server with 12 GB RAM, dual 4-core
> >>> CPUs.
> >>> >> Storage is a LSI MegaSAS 1078 controller (mfi) in a RAID-10
> >>> >> configuration, using UFS + geom_journal for filesystem.
> >>> >>
> >>> >> Postgresql performance, as seen via pgbench, dropped by approx 20%.
> >>> >> This testing was done with our usual PAE-enabled kernels.  We then
> >>> went
> >>> >> back to GENERIC kernels and did comparisons using "bonnie", results
> >>> >> below.  Following that is a kernel boot log.
> >>> >>
> >>> >> Notably, we're seeing this regression only with our RAID mfi(4) ba=
sed
> >>> >> systems.  Notably, from looking at FreeBSD source changelogs it
> >>> appears
> >>> >> that the mfi(4) code has seen some changes since 8.1.
> >>> > Between 8.1 and 8.2 mfi has not had any significant changes.  The
> >>> only changes
> >>> > made to sys/dev/mfi were to add a new constant:
> >>> >
> >>> >> svn diff svn+ssh://svn.freebsd.org/base/releng/8.1/sys/dev/mfi
> >>> > svn+ssh://svn.freebsd.org/base/releng/8.2/sys/dev/mfi
> >>> > Index: mfireg.h
> >>> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>> > --- mfireg.h    (.../8.1/sys/dev/mfi)   (revision 237134)
> >>> > +++ mfireg.h    (.../8.2/sys/dev/mfi)   (revision 237134)
> >>> > @@ -975,7 +975,9 @@
> >>> >          MFI_PD_STATE_OFFLINE =3D 0x10,
> >>> >          MFI_PD_STATE_FAILED =3D 0x11,
> >>> >          MFI_PD_STATE_REBUILD =3D 0x14,
> >>> > -       MFI_PD_STATE_ONLINE =3D 0x18
> >>> > +       MFI_PD_STATE_ONLINE =3D 0x18,
> >>> > +       MFI_PD_STATE_COPYBACK =3D 0x20,
> >>> > +       MFI_PD_STATE_SYSTEM =3D 0x40
> >>> >   };
> >>> >
> >>> >   union mfi_ld_ref {
> >>> >
> >>> > The difference in write performance must be due to something else.
> >>> You
> >>> > mentioned you are using UFS + gjournal.  I think gjournal uses
> >>> BIO_FLUSH, so I
> >>> > wonder if this is related:
> >>> >
> >>> >
> >>> ---------------------------------------------------------------------=
=2D--
> >>> > r212939 | gibbs | 2010-09-20 19:39:00 -0400 (Mon, 20 Sep 2010) | 61
> >>> lines
> >>> >
> >>> > MFC 212160:
> >>> >
> >>> > Correct bioq_disksort so that bioq_insert_tail() offers barrier
> >>> semantic.
> >>> > Add the BIO_ORDERED flag for struct bio and update bio clients to
> >>> use it.
> >>> >
> >>> > The barrier semantics of bioq_insert_tail() were broken in two ways:
> >>> >
> >>> >   o In bioq_disksort(), an added bio could be inserted at the head =
of
> >>> >     the queue, even when a barrier was present, if the sort key for
> >>> >     the new entry was less than that of the last queued barrier bio.
> >>> >
> >>> >   o The last_offset used to generate the sort key for newly queued
> >>> bios
> >>> >     did not stay at the position of the barrier until either the
> >>> >     barrier was de-queued, or a new barrier (which updates
> >>> last_offset)
> >>> >     was queued.  When a barrier is in effect, we know that the disk
> >>> >     will pass through the barrier position just before the
> >>> >     "blocked bios" are released, so using the barrier's offset for
> >>> >     last_offset is the optimal choice.
> >>> >
> >>> > sys/geom/sched/subr_disk.c:
> >>> > sys/kern/subr_disk.c:
> >>> >          o Update last_offset in bioq_insert_tail().
> >>> >
> >>> >          o Only update last_offset in bioq_remove() if the removed
> >>> bio is
> >>> >            at the head of the queue (typically due to a call via
> >>> >            bioq_takefirst()) and no barrier is active.
> >>> >
> >>> >          o In bioq_disksort(), if we have a barrier (insert_point is
> >>> non-NULL),
> >>> >            set prev to the barrier and cur to it's next element.
> >>> Now that
> >>> >            last_offset is kept at the barrier position, this change
> >>> isn't
> >>> >            strictly necessary, but since we have to take a decision
> >>> branch
> >>> >            anyway, it does avoid one, no-op, loop iteration in the
> >>> while
> >>> >            loop that immediately follows.
> >>> >
> >>> >          o In bioq_disksort(), bypass the normal sort for bios with
> >>> the
> >>> >            BIO_ORDERED attribute and instead insert them into the
> >>> queue
> >>> >            with bioq_insert_tail().  bioq_insert_tail() not only gi=
ves
> >>> >            the desired command order during insertion, but also
> >>> provides
> >>> >            barrier semantics so that commands disksorted in the fut=
ure
> >>> >            cannot pass the just enqueued transaction.
> >>> >
> >>> > sys/sys/bio.h:
> >>> >          Add BIO_ORDERED as bit 4 of the bio_flags field in struct
> >>> bio.
> >>> >
> >>> > sys/cam/ata/ata_da.c:
> >>> > sys/cam/scsi/scsi_da.c
> >>> >          Use an ordered command for SCSI/ATA-NCQ commands issued in
> >>> >          response to bios with the BIO_ORDERED flag set.
> >>> >
> >>> > sys/cam/scsi/scsi_da.c
> >>> >          Use an ordered tag when issuing a synchronize cache comman=
d.
> >>> >
> >>> >          Wrap some lines to 80 columns.
> >>> >
> >>> > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> >>> > sys/geom/geom_io.c
> >>> >          Mark bios with the BIO_FLUSH command as BIO_ORDERED.
> >>> >
> >>> > Sponsored by:   Spectra Logic Corporation
> >>> >
> >>> ---------------------------------------------------------------------=
=2D--
> >>> >
> >>> > Can you try perhaps commenting out the 'bp->bio_flags |=3D
> >>> BIO_ORDERED' line
> >>> > changed in geom_io.c in 8.2?  That would be effectively reverting t=
his
> >>> > portion of the diff:
> >>> >
> >>> > Index: geom_io.c
> >>> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>> > --- geom_io.c   (.../8.1/sys/geom)      (revision 237134)
> >>> > +++ geom_io.c   (.../8.2/sys/geom)      (revision 237134)
> >>> > @@ -265,6 +265,7 @@
> >>> >          g_trace(G_T_BIO, "bio_flush(%s)", cp->provider->name);
> >>> >          bp =3D g_alloc_bio();
> >>> >          bp->bio_cmd =3D BIO_FLUSH;
> >>> > +       bp->bio_flags |=3D BIO_ORDERED;
> >>> >          bp->bio_done =3D NULL;
> >>> >          bp->bio_attribute =3D NULL;
> >>> >          bp->bio_offset =3D cp->provider->mediasize;
> >>> >
> >>>
> >>> John... thanks for the suggestion.  I've built and tested a kernel wi=
th
> >>> this change made.  Result:  no change (same performance as with
> >>> 8.2-GENERIC). Any thoughts as to where to go next?
> >>
> >> Hmm.  That seemed the most plausible candidate when I looked at this.
> >>
> >> Do you use quotas (there is one change in UFS related to quotas)?
> >>
> >> There are 5 changes that involve sys/kern/vfs_bio.c in 8.2:
> >> 209459, 212229, 212562, 212583, and 213890.
> >>
> >> Can you possibly test out kernels from stable/8 at those revisions on =
an
> >> 8.1
> >> world and see if you can narrow it down futher?
> >>
> >> Barring that, can you do a binary search of kernels from stable/8
> >> between 8.1
> >> and 8.2 on an 8.1 world to see which commit caused the change in write
> >> performance?
> >>
> >
> > Hi John, I'm working with Charles to narrow this down.
> >
> > Looks like revision 212229 is the culprit, or at least around the same
> > time to it, if this change isn't what slowed things down. The change to
> > sys/kern/vfs_bio.c modifies some synchronization in dev_strategy():
> >
>=20
> Actually, hold that thought. I had a hunch that I wasn't thorough=20
> enough, so I decided to try 212228 =E2=80=94 the performance is the same =
as with=20
> 212229, so vfs_bio seems to be out of the picture. I'm going to binary=20
> search between 209459 and 212229, and see what I find.

Ok.  Please let me know what you find.  Thanks!

=2D-=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201207130939.54311.jhb>