From owner-freebsd-scsi@FreeBSD.ORG  Sat Jan 14 23:46:23 2012
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: freebsd-scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0EF521065675;
	Sat, 14 Jan 2012 23:46:23 +0000 (UTC) (envelope-from ken@kdm.org)
Received: from nargothrond.kdm.org (nargothrond.kdm.org [70.56.43.81])
	by mx1.freebsd.org (Postfix) with ESMTP id BB0D88FC08;
	Sat, 14 Jan 2012 23:46:22 +0000 (UTC)
Received: from nargothrond.kdm.org (localhost [127.0.0.1])
	by nargothrond.kdm.org (8.14.2/8.14.2) with ESMTP id q0ENMj6A059590;
	Sat, 14 Jan 2012 16:22:46 -0700 (MST)
	(envelope-from ken@nargothrond.kdm.org)
Received: (from ken@localhost)
	by nargothrond.kdm.org (8.14.2/8.14.2/Submit) id q0ENMjvu059589;
	Sat, 14 Jan 2012 16:22:45 -0700 (MST) (envelope-from ken)
Date: Sat, 14 Jan 2012 16:22:45 -0700
From: "Kenneth D. Merry" <ken@freebsd.org>
To: John <jwd@freebsd.org>
Message-ID: <20120114232245.GA57880@nargothrond.kdm.org>
References: <20120114051618.GA41288@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120114051618.GA41288@FreeBSD.org>
User-Agent: Mutt/1.4.2i
Cc: freebsd-scsi@freebsd.org
Subject: Re: mps driver chain_alloc_fail / performance ?
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Jan 2012 23:46:23 -0000

On Sat, Jan 14, 2012 at 05:16:18 +0000, John wrote:
> Hi Folks,
> 
>    I've started poking through the source for this, but thought I'd
> go ahead and post to ask other's their opinion.
> 
>    I have a system with 3 LSI SAS hba cards installed:
> 
> mps0: <LSI SAS2116> port 0x5000-0x50ff mem 0xf5ff0000-0xf5ff3fff,0xf5f80000-0xf5fbffff irq 30 at device 0.0 on pci13
> mps0: Firmware: 05.00.13.00
> mps0: IOCCapabilities: 285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay>
> mps1: <LSI SAS2116> port 0x7000-0x70ff mem 0xfbef0000-0xfbef3fff,0xfbe80000-0xfbebffff irq 48 at device 0.0 on pci33
> mps1: Firmware: 07.00.00.00
> mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
> mps2: <LSI SAS2116> port 0x6000-0x60ff mem 0xfbcf0000-0xfbcf3fff,0xfbc80000-0xfbcbffff irq 56 at device 0.0 on pci27
> mps2: Firmware: 07.00.00.00
> mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>

The firmware on those boards is a little old.  You might consider
upgrading.

>    Basically, one for internal and two for external drives, for a total
> of about 200 drives, ie:
> 
> # camcontrol inquiry da10
> pass21: <HP EG0600FBLSH HPD2> Fixed Direct Access SCSI-5 device 
> pass21: Serial Number 6XR14KYV0000B148LDKM
> pass21: 600.000MB/s transfers, Command Queueing Enabled

That's a lot of drives!  I've only run up to 60 drives.

>    When running the system under load, I see the following reported:
> 
> hw.mps.0.allow_multiple_tm_cmds: 0
> hw.mps.0.io_cmds_active: 0
> hw.mps.0.io_cmds_highwater: 772
> hw.mps.0.chain_free: 2048
> hw.mps.0.chain_free_lowwater: 1832
> hw.mps.0.chain_alloc_fail: 0         <--- Ok
> 
> hw.mps.1.allow_multiple_tm_cmds: 0
> hw.mps.1.io_cmds_active: 0
> hw.mps.1.io_cmds_highwater: 1019
> hw.mps.1.chain_free: 2048
> hw.mps.1.chain_free_lowwater: 0
> hw.mps.1.chain_alloc_fail: 14369     <---- ??
> 
> hw.mps.2.allow_multiple_tm_cmds: 0
> hw.mps.2.io_cmds_active: 0
> hw.mps.2.io_cmds_highwater: 1019
> hw.mps.2.chain_free: 2048
> hw.mps.2.chain_free_lowwater: 0
> hw.mps.2.chain_alloc_fail: 13307     <---- ??
> 
>    So finally my question (sorry, I'm long winded): What is the
> correct way to increase the number of elements in sc->chain_list
> so mps_alloc_chain() won't run out?

Bump MPS_CHAIN_FRAMES to something larger.  You can try 4096 and see what
happens.

> static __inline struct mps_chain *
> mps_alloc_chain(struct mps_softc *sc)
> {
>         struct mps_chain *chain;
>         
>         if ((chain = TAILQ_FIRST(&sc->chain_list)) != NULL) {  
>                 TAILQ_REMOVE(&sc->chain_list, chain, chain_link);
>                 sc->chain_free--;
>                 if (sc->chain_free < sc->chain_free_lowwater)
>                         sc->chain_free_lowwater = sc->chain_free;
>         } else
>                 sc->chain_alloc_fail++;
>         return (chain);
> }
> 
>    A few layers up, it seems like it would be nice if the buffer
> exhaustion was reported outside of debug being enabled... at least
> maybe the first time.

It used to report being out of chain frames every time it happened, which
wound up being too much.  You're right, doing it once might be good.

>    It looks like changing the related #define is the only way.

Yes, that is currently the only way.  Yours is by far the largest setup
I've seen so far.  I've run the driver with 60 drives attached.

>    Does anyone have any experience with tuning this driver for high
> throughput/large disk arrays? The shelves are all dual pathed, and with
> the new gmultipath active/active support, I've still only been able to
> achieve about 500MBytes per second across the controllers/drives.

Once you bump up the number of chain frames to the point where you aren't
running out, I doubt the driver will be the big bottleneck.  It'll probably
be other things higher up the stack.

> ps: I currently have a ccd on top of these drives which seems to
> perform more consistenty then zfs. But that's an email for a different
> day :-)

What sort of ZFS topology did you try?

I know for raidz2, and perhaps for raidz, ZFS is faster if your number of
data disks is a power of 2.

If you want raidz2 protection, try creating arrays in groups of 10, so you
wind up having 8 data disks.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG