From owner-freebsd-stable@FreeBSD.ORG Thu Jul 12 20:34:19 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D9405106564A; Thu, 12 Jul 2012 20:34:19 +0000 (UTC) (envelope-from smccoy@greatbaysoftware.com) Received: from ecbiz102.inmotionhosting.com (ecbiz102.inmotionhosting.com [70.39.235.94]) by mx1.freebsd.org (Postfix) with ESMTP id 944108FC08; Thu, 12 Jul 2012 20:34:19 +0000 (UTC) Received: from c-71-233-85-132.hsd1.nh.comcast.net ([71.233.85.132]:50618 helo=smccoy-mbp.local) by ecbiz102.inmotionhosting.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1SpQ5L-0001Rq-Gs; Thu, 12 Jul 2012 16:34:09 -0400 Message-ID: <4FFF34BA.9030002@greatbaysoftware.com> Date: Thu, 12 Jul 2012 16:34:02 -0400 From: Steve McCoy User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:13.0) Gecko/20120614 Thunderbird/13.0.1 MIME-Version: 1.0 To: jhb@freebsd.org References: <4FDABA0B.5030702@greatbaysoftware.com> <201206150804.46341.jhb@freebsd.org> <4FE3DA14.9090506@greatbaysoftware.com> <4FFF301E.30603@greatbaysoftware.com> In-Reply-To: <4FFF301E.30603@greatbaysoftware.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - ecbiz102.inmotionhosting.com X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - greatbaysoftware.com Cc: Charles Owens , freebsd-stable@freebsd.org Subject: Re: mfi(4) IO performance regression, post 8.1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Jul 2012 20:34:19 -0000 On 7/12/12 4:14 PM, Charles Owens wrote: > On Thursday, June 21, 2012 10:36:04 pm Charles Owens wrote: >> >> On 6/15/12 8:04 AM, John Baldwin wrote: >> > On Friday, June 15, 2012 12:28:59 am Charles Owens wrote: >> >> Hello FreeBSD folk, >> >> >> >> We're seeing what appears to be a storage performance regression as we >> >> try to move from 8.1 (i386) to 8.3. We looked at 8.2 also and it >> >> appears that the regression happened between 8.1 and 8.2. >> >> >> >> Our system is an Intel S5520UR Server with 12 GB RAM, dual 4-core >> CPUs. >> >> Storage is a LSI MegaSAS 1078 controller (mfi) in a RAID-10 >> >> configuration, using UFS + geom_journal for filesystem. >> >> >> >> Postgresql performance, as seen via pgbench, dropped by approx 20%. >> >> This testing was done with our usual PAE-enabled kernels. We then >> went >> >> back to GENERIC kernels and did comparisons using "bonnie", results >> >> below. Following that is a kernel boot log. >> >> >> >> Notably, we're seeing this regression only with our RAID mfi(4) based >> >> systems. Notably, from looking at FreeBSD source changelogs it >> appears >> >> that the mfi(4) code has seen some changes since 8.1. >> > Between 8.1 and 8.2 mfi has not had any significant changes. The >> only changes >> > made to sys/dev/mfi were to add a new constant: >> > >> >> svn diff svn+ssh://svn.freebsd.org/base/releng/8.1/sys/dev/mfi >> > svn+ssh://svn.freebsd.org/base/releng/8.2/sys/dev/mfi >> > Index: mfireg.h >> > =================================================================== >> > --- mfireg.h (.../8.1/sys/dev/mfi) (revision 237134) >> > +++ mfireg.h (.../8.2/sys/dev/mfi) (revision 237134) >> > @@ -975,7 +975,9 @@ >> > MFI_PD_STATE_OFFLINE = 0x10, >> > MFI_PD_STATE_FAILED = 0x11, >> > MFI_PD_STATE_REBUILD = 0x14, >> > - MFI_PD_STATE_ONLINE = 0x18 >> > + MFI_PD_STATE_ONLINE = 0x18, >> > + MFI_PD_STATE_COPYBACK = 0x20, >> > + MFI_PD_STATE_SYSTEM = 0x40 >> > }; >> > >> > union mfi_ld_ref { >> > >> > The difference in write performance must be due to something else. You >> > mentioned you are using UFS + gjournal. I think gjournal uses >> BIO_FLUSH, so I >> > wonder if this is related: >> > >> > >> ------------------------------------------------------------------------ >> > r212939 | gibbs | 2010-09-20 19:39:00 -0400 (Mon, 20 Sep 2010) | 61 >> lines >> > >> > MFC 212160: >> > >> > Correct bioq_disksort so that bioq_insert_tail() offers barrier >> semantic. >> > Add the BIO_ORDERED flag for struct bio and update bio clients to >> use it. >> > >> > The barrier semantics of bioq_insert_tail() were broken in two ways: >> > >> > o In bioq_disksort(), an added bio could be inserted at the head of >> > the queue, even when a barrier was present, if the sort key for >> > the new entry was less than that of the last queued barrier bio. >> > >> > o The last_offset used to generate the sort key for newly queued bios >> > did not stay at the position of the barrier until either the >> > barrier was de-queued, or a new barrier (which updates last_offset) >> > was queued. When a barrier is in effect, we know that the disk >> > will pass through the barrier position just before the >> > "blocked bios" are released, so using the barrier's offset for >> > last_offset is the optimal choice. >> > >> > sys/geom/sched/subr_disk.c: >> > sys/kern/subr_disk.c: >> > o Update last_offset in bioq_insert_tail(). >> > >> > o Only update last_offset in bioq_remove() if the removed >> bio is >> > at the head of the queue (typically due to a call via >> > bioq_takefirst()) and no barrier is active. >> > >> > o In bioq_disksort(), if we have a barrier (insert_point is >> non-NULL), >> > set prev to the barrier and cur to it's next element. >> Now that >> > last_offset is kept at the barrier position, this change >> isn't >> > strictly necessary, but since we have to take a decision >> branch >> > anyway, it does avoid one, no-op, loop iteration in the >> while >> > loop that immediately follows. >> > >> > o In bioq_disksort(), bypass the normal sort for bios with the >> > BIO_ORDERED attribute and instead insert them into the queue >> > with bioq_insert_tail(). bioq_insert_tail() not only gives >> > the desired command order during insertion, but also >> provides >> > barrier semantics so that commands disksorted in the future >> > cannot pass the just enqueued transaction. >> > >> > sys/sys/bio.h: >> > Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio. >> > >> > sys/cam/ata/ata_da.c: >> > sys/cam/scsi/scsi_da.c >> > Use an ordered command for SCSI/ATA-NCQ commands issued in >> > response to bios with the BIO_ORDERED flag set. >> > >> > sys/cam/scsi/scsi_da.c >> > Use an ordered tag when issuing a synchronize cache command. >> > >> > Wrap some lines to 80 columns. >> > >> > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c >> > sys/geom/geom_io.c >> > Mark bios with the BIO_FLUSH command as BIO_ORDERED. >> > >> > Sponsored by: Spectra Logic Corporation >> > >> ------------------------------------------------------------------------ >> > >> > Can you try perhaps commenting out the 'bp->bio_flags |= >> BIO_ORDERED' line >> > changed in geom_io.c in 8.2? That would be effectively reverting this >> > portion of the diff: >> > >> > Index: geom_io.c >> > =================================================================== >> > --- geom_io.c (.../8.1/sys/geom) (revision 237134) >> > +++ geom_io.c (.../8.2/sys/geom) (revision 237134) >> > @@ -265,6 +265,7 @@ >> > g_trace(G_T_BIO, "bio_flush(%s)", cp->provider->name); >> > bp = g_alloc_bio(); >> > bp->bio_cmd = BIO_FLUSH; >> > + bp->bio_flags |= BIO_ORDERED; >> > bp->bio_done = NULL; >> > bp->bio_attribute = NULL; >> > bp->bio_offset = cp->provider->mediasize; >> > >> >> John... thanks for the suggestion. I've built and tested a kernel with >> this change made. Result: no change (same performance as with >> 8.2-GENERIC). Any thoughts as to where to go next? > > Hmm. That seemed the most plausible candidate when I looked at this. > > Do you use quotas (there is one change in UFS related to quotas)? > > There are 5 changes that involve sys/kern/vfs_bio.c in 8.2: > 209459, 212229, 212562, 212583, and 213890. > > Can you possibly test out kernels from stable/8 at those revisions on an > 8.1 > world and see if you can narrow it down futher? > > Barring that, can you do a binary search of kernels from stable/8 > between 8.1 > and 8.2 on an 8.1 world to see which commit caused the change in write > performance? > Hi John, I'm working with Charles to narrow this down. Looks like revision 212229 is the culprit, or at least around the same time to it, if this change isn't what slowed things down. The change to sys/kern/vfs_bio.c modifies some synchronization in dev_strategy(): --- stable/8/sys/kern/vfs_bio.c 2010/06/23 10:06:57 209459 +++ stable/8/sys/kern/vfs_bio.c 2010/09/05 14:27:55 212229 @@ -3195,6 +3195,7 @@ { struct cdevsw *csw; struct bio *bip; + int ref; if ((!bp->b_iocmd) || (bp->b_iocmd & (bp->b_iocmd - 1))) panic("b_iocmd botch"); @@ -3216,7 +3217,7 @@ KASSERT(dev->si_refcount > 0, ("dev_strategy on un-referenced struct cdev *(%s)", devtoname(dev))); - csw = dev_refthread(dev); + csw = dev_refthread(dev, &ref); if (csw == NULL) { g_destroy_bio(bip); bp->b_error = ENXIO; @@ -3225,7 +3226,7 @@ return; } (*csw->d_strategy)(bip); - dev_relthread(dev); + dev_relthread(dev, ref); } /* This is in line with the following changes to dev_refthread and dev_relthread, but it's not obvious to me how these alone would impact performance. Do you have any thoughts? --- stable/8/sys/kern/kern_conf.c 2010/07/03 18:19:59 209668 +++ stable/8/sys/kern/kern_conf.c 2010/09/05 14:27:55 212229 @@ -177,12 +177,16 @@ } struct cdevsw * -dev_refthread(struct cdev *dev) +dev_refthread(struct cdev *dev, int *ref) { struct cdevsw *csw; struct cdev_priv *cdp; mtx_assert(&devmtx, MA_NOTOWNED); + if ((dev->si_flags & SI_ETERNAL) != 0) { + *ref = 0; + return (dev->si_devsw); + } dev_lock(); csw = dev->si_devsw; if (csw != NULL) { @@ -193,36 +197,59 @@ csw = NULL; } dev_unlock(); + *ref = 1; return (csw); } … void -dev_relthread(struct cdev *dev) +dev_relthread(struct cdev *dev, int ref) { mtx_assert(&devmtx, MA_NOTOWNED); + if (!ref) + return; dev_lock(); KASSERT(dev->si_threadcount > 0, ("%s threadcount is wrong", dev->si_name));